Hacker News new | past | comments | ask | show | jobs | submit login
Paperless: Scan and index paper documents (github.com)
132 points by DaGardner on Feb 11, 2016 | hide | past | web | favorite | 59 comments



For me the problem is stability and future-proofness. Technology changes very quickly. If the maintainer loses interest, the software may rot away as the dependencies change, etc.

Important documents often need to be stored for 5-10-20 years. Why put everything in this shiny new software, when it may change in 1 or 2 years?

I think it's best to just put scanned pdfs in folders based on year and topic. Those can be easily and transparently backed up and searched.

But on a few months timescale this software could be useful.


Funnily enough, I use a commercial app with the same name (Paperless) that does exactly that. It scans the documents, applies OCR and saves the pdfs in a folder that, in my case, is automatically synced with dropbox and backed up to a local NAS.

It doesn't have search functionality (well, it does, but it's basically useless) but allows to set categories and tags, which is more than enough for me.

There's an added issue with this kind of solutions, in most cases you still need to keep the original. Having them scanned is great for record keeping and for communicating with you own accountant, but if there is a problem (tax audit, proving ownership, etc, etc) you'll have to produce the paper original.


Banks don't keep original paper checks any more, just a scanned copy. Fax is good enough for contracts and other legal paperwork. I was pretty sure that scanned documents are legally equivalant to originals. Is this not true?


AFAIK, depends on the country and legislation. In Ireland (and Spain) I've had to produce originals where the signature was clearly hand-written (they looked for pen pressure points, for example).

I've even had kafkaesque situations where I was asked for the original of a document that was only available online. In those cases I had to present a printed copy of the document and a signed document (from the bank in this case) saying that they didn't send hard copies/originals.


You are correct. (In the United States at least.)

> [...] maintain books and records by using an electronic storage system that either images their hardcopy (paper) books and records, or transfers their computerized books and records, to an electronic storage media, such as an optical disk.

https://www.irs.gov/pub/irs-tege/rp-97-22.pdf


Many (most?) banks, insurance companies have done away with physical paper and are using document management solutions of one kind or another. From a UK perspective a good starting point is BS 10008[1]. However, there is no guaranteed way that every company interprets the multitude of legal and compliance obligations. I work in this space.

[1] http://www.bsigroup.com/en-GB/bs-10008-electronic-informatio...


It was also the name I choose for a similar app https://github.com/garnieretienne/paperless


There's an added issue with this kind of solutions, in most cases you still need to keep the original.

Isn't the solution just banker's boxes in the attic to house the originals? I've never quite thought of that as an issue. Every quarter or so I move a stack of papers from the home office into a box I'll probably never have to retrieve anything from.


That's why I archive my document scans as one bit per pixel PNGs. It ends up being 20KB-50KB per page at 150 PPI. I figure that there will always be a way to get the pixels out of a PNG. PDF is a more complex and dynamic standard.


That's true, but PNGs can't have a text layer for searchability.


I'm pretty sure text files can reside in the same folder as PNG files.


Yes, but not all documents are trivial to convert to a text file because the layout can be quite complex. A PDF file can have little bits of text floating anywhere and when you search inside the file, you can see it highlighted at its actual position.


I've had to work with PDF files before, and they're absolutely horrible. Precisely because of what you state: "little bits of text floating anywhere". Or something like disjoint, not-grouped, lines for table drawing, instead of a generic table with formatting, width/height, etc.

Though, I generally agree with you, PDF/A is quite a good way of storing documents for long-term. But, that doesn't mean that PNG files along with text files, even with x:y coordinates next to the pieces of text, aren't a feasible alternative.


Consider archiving them as djvu (http://djvu.sourceforge.net/). One bit per pixel djvu files at 150PPI will likely become 2-5KB pages instead.

Djvu also supports a text layer just like PDF.

Note that 150PPI is barely better than FAX, so your documents will likely look 'faxed' if you ever have to output hardcopies for some reason.


Djvu is patented. As a result it is very possible that is will never achieve enough critical mass to be suitable for long term archiving.


People who worry about these things professionally generally would veer towards TIFF if PDF was insufficient. PDF/A does stuff like embed fonts and avoid proprietary compression & encryption, to avoid likely long term failure scenarios.

In the US, permanently retained documents like court records are kept in PDF. It will be around.


TIFF. Often expanded as "Thousands of Incompatible File Formats".


PDF is certainly more complex but its an ISO standard and even has an archival version: PDF/A.


FWIW, PNG is ISO/IEC 15948.


At least this is open source and you control your data vs whatever is today's popular mobile / SaaS app offering the same.


    Those can be easily and transparently backed up and searched.
I was recently asked if this is easy to do on Windows, especially the search part. What solution would you propose to someone who wants to index many PDF files already in such a folder structure?


I have been doing something similar for a couple of years.

My printer can scan to a shared drive on my home LAN, saving files as PDFs. These are then uploaded Google Drive where everything else happens automatically (e.g. if you search for something, it will find it in scanned PDFs automatically).

Its super-useful especially since the mobile clients for drive is rock solid. I can be on the phone to someone and pull up basically any document I've had since the 90s in a couple of seconds, for free. Its kinda fun being on the phone to a call centre and being able to pull up data quicker than they can. Tax returns are an absolute doddle when everything is paperless.

The only thing that is missing for me from Google Drive is like a "Knowledge Graph" for my own documents - I can search by keyword or filename etc sure, but I'd like to get some "intelligence" next like we're used to with Google Now, but for my scanned docs, like "show me my bank statements with a payment to Amazon in the last 3 months" etc.


So now people are trying to give Google more highly sensitive personal information?

/facepalm


Why is this a facepalm? Privacy advocacy is about giving people the option to decide what they share, not preventing them from sharing. I don't have a problem at all if someone voluntarily decides to share their private information with Google. Why does it matter to you what he does with his data?


It matters for the same reason I think US gun violence is a problem, even though I don't live there.

It matters for the same reason I think banning encryption in the UK is a problem, even though I don't live there.

It matters for the same reason I think the millions of people riding around on motor scooters here without any protective equipment/clothing, often against traffic in the parking/emergency lane is a problem.

The more data people give to an organisation like Google, the more power it gets.

The more power it gets, the more data it gets.

The more data it gets, ....


Going to need you to finish that train of thought. Data aggregation (or power concentration) isn't any more inherently machiavellian than any of the non sequiturs you rattled off.


We need legislation to put limits on what Google can do with said data. Until then, these tools should not be left on the table.


what printer do you use?


If you're interested in scanning many documents, I suggest you take a look at a ScanSnap from Fujitsu[1]. I've seen these used in a business environment (~30 machines) without any problem on Windows. The throughput is great and all employees were able to use the machine in less than 5 minutes.

[1]:http://www.fujitsu.com/ca/en/products/computing/peripheral/s...


If you don't want to buy a document scanner, just use your mobile phone for this.

I personally use Scanbot for this, it automatically recognizes, crops and OCRs documents (on the device) and stores them as PDF with the extracted text in the location of your choosing. Works well enough.


I've been using Evernote's Scannable for receipts and single pages. I had been using a scanner w/ ADF, but it was slow I never automated it.

Scannable works really fast and Evernote indexes PDFs.

If only Evernote's editor didn't make me want to switch away every time I use it...


I use google docs for it. You can upload scanned documents to Google docs. Documents are automatically OCRed, you can search by keywords and you can still access the original image.

Disclaimer: I work at google, although not on the Google docs team.


Last time I checked it's much cheaper to get document scanner with ADF built together with a laser printer than to buy one standalone. I was quite surprised.


I can't really back this up with empirical evidence but in my experience the ADF's on consumer all in one's tend to be a bit crap compared to getting a standalone one.


Nice combination of technologies to solve a problem -- could be very useful for a business that needs to be able to archive and access paper records.

But for a household -- there are very few documents you need to keep long term. Better to just keep those in a fireproof file box, and shred and discard everything else rather than devote any resources or mental energy to keeping them around in either paper or digital form.


>few documents you need to keep long term

I disagree. While I'm a huge fan of purging, there are many, many cases where you need/want documents.

Theft/fire/casualty: old receipts prove ownership and value.

Maintenance: who worked on the furnace 4 yrs ago?

Warranty: our windows have a 20 year warranty (and we're using it!)

Basis for home improvements: when you sell your home, if you can document improvements, you can raise your basis and lower your capital gains.

Repair: where's the part number & diagram for the faucet that's leaking?

School records for your children.

Etc.


I would add university notes. I was really clumsy with mine but my gf has all her notes for all the classes from uni. It takes up two large shelves, it's freaking massive. We want to scan it but it's a huge task.


For what it's worth, I scan all of those documents you mentioned. The only originals I have in my fire chest are: the deed to my house (even though I can get a certified copy from the county), birth certificates, immigration paperwork, vehicle title, passports, and irreplaceable documents of sentimental or practical value. Even these are scanned as a backup.

All other items are either delivered electronically, like bills, or scanned and shredded upon arrival, like insurance policies. (Sometimes I take a picture of manual covers or packaging so I can have part and serial numbers.) I keep copies of the files in various encrypted places, including a USB stick that goes with me.


I'm with you on a few of those. For insurance purposes you only need to worry about documenting items that are unusually valuable. School records (e.g. report cards, etc) I don't keep, and have never needed. A major home improvement expense I would probably keep, though capital gains on a primary home sale are generally exempt up to $250,000(?).

But I was really more thinking about everyday utility bills, other statements and invoices -- I just trash all that stuff as soon as it's paid. I have better things to do than organize papers that I will never look at again.


Only a few of your examples require the original paper at hand. Having them scanned and searchable makes your life a lot easier.


These are the words of someone who has never applied for a mortgage.


How so? I have a mortgage and am doing a refinance right now. In both cases, I delivered the documents requested by the lender electronically. Bank statements, pay stubs, insurance information, and so on were all uploaded or emailed. The only paper generated has been the stack of forms to sign and I scanned and shredded those at the conclusion of the transaction (except the deed).


Naivety is charming.


It would be nice if this joined forces with Camlistore to hurry up the Scanning Cabinet replacement :-)


Mathieu's got his code up for review!

https://camlistore-review.googlesource.com/#/c/5416/


I bought a high-speed scanner with OCR a few years ago. MacOS automatically indexes PDFs, so I can easily search through my scanned documents in Finder.

A magic folder system, like Dropbox or Syncplicity, makes sure that the pdfs are safely backed up for me.


You can use Docady's scanner that also does OCR and recognizes its content. It then stores your documents and encrypts them. At the moment it's available on iOS, but should be available soon in Android too.

Demo: https://www.youtube.com/watch?v=cN_Zw6xoUaw

App: https://itunes.apple.com/US/app/id921250909?mt=8

(Full Disclosure: I work at Docady and part of its team)


I've found adobe acrobat x to be great for OCR and indexing of PDFs. Nothing else I've used comes close to what it can recognise.


There's lots I don't like about Acrobat X (and now DC), but ClearScan is an awesome format for scanning and retaining PDF documents. I wish (though don't expect) Adobe would open source it.


Have you used ABBYY software? Just interested in a comparison.


It seems somewhat ironic to me that someone built this whole paper to ocr system, and then says "hey use it with a scanner like X", which has OCR capabilities (producing searchable PDFs) built in.


It's not built in to the scanner. It's done by the software that comes with the scanner for Windows and OS X.


Thanks for the clarification.


OP, great job. I have been trying to solve this very same problem for over an year now, and have a business plan based on the same. Is there a way I can pm you and get some clarifications. Thanks.


Any chance of a wiki to group-collaborate on getting different scanners to work with this?

I have a HP envy I'd like to glue to the cloud.


No Dockerfile? I have a Dockerfile for handling Web cam images sent by FTP: https://github.com/kaihendry/camftp2web/blob/master/Dockerfi...


What about just take a picture?


Take a picture with your stock camera, or use an app that willcrop, apply OCR and convert your image to a document format? The former lacks features without relying on another program. The latter is only good for people who infrequently need something scanned. Most people can probably get by using a phone app, but this is for people with lots of paper documents.


That said, even if you have a dedicated document scanner, you should spend the $5 or whatever to get Scanbot Pro for your phone. Makes it so friggin easy to get a quick scan of a document if you ever need it. And it does on-phone OCR (far as I can tell the results are pretty good!) with the option to upload to dropbox or a similar service.

I mention Scanbot specifically because I bought and tried about 5 or 6 of the available document scanning apps on iOS before settling on it. It really does do literally everything I want in a document scanning app short of tagging the documents (which would, of course, be a file-system specific thing).




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: