For me the problem is stability and future-proofness. Technology changes very quickly. If the maintainer loses interest, the software may rot away as the dependencies change, etc.

Important documents often need to be stored for 5-10-20 years. Why put everything in this shiny new software, when it may change in 1 or 2 years?

I think it's best to just put scanned pdfs in folders based on year and topic. Those can be easily and transparently backed up and searched.

But on a few months timescale this software could be useful.

Funnily enough, I use a commercial app with the same name (Paperless) that does exactly that. It scans the documents, applies OCR and saves the pdfs in a folder that, in my case, is automatically synced with dropbox and backed up to a local NAS.

It doesn't have search functionality (well, it does, but it's basically useless) but allows to set categories and tags, which is more than enough for me.

There's an added issue with this kind of solutions, in most cases you still need to keep the original. Having them scanned is great for record keeping and for communicating with you own accountant, but if there is a problem (tax audit, proving ownership, etc, etc) you'll have to produce the paper original.

Banks don't keep original paper checks any more, just a scanned copy. Fax is good enough for contracts and other legal paperwork. I was pretty sure that scanned documents are legally equivalant to originals. Is this not true?

AFAIK, depends on the country and legislation. In Ireland (and Spain) I've had to produce originals where the signature was clearly hand-written (they looked for pen pressure points, for example).

I've even had kafkaesque situations where I was asked for the original of a document that was only available online. In those cases I had to present a printed copy of the document and a signed document (from the bank in this case) saying that they didn't send hard copies/originals.

You are correct. (In the United States at least.)

> [...] maintain books and records by using an electronic storage system that either images their hardcopy (paper) books and records, or transfers their computerized books and records, to an electronic storage media, such as an optical disk.


Many (most?) banks, insurance companies have done away with physical paper and are using document management solutions of one kind or another. From a UK perspective a good starting point is BS 10008[1]. However, there is no guaranteed way that every company interprets the multitude of legal and compliance obligations. I work in this space.

[1] http://www.bsigroup.com/en-GB/bs-10008-electronic-informatio...

It was also the name I choose for a similar app https://github.com/garnieretienne/paperless

Isn't the solution just banker's boxes in the attic to house the originals? I've never quite thought of that as an issue. Every quarter or so I move a stack of papers from the home office into a box I'll probably never have to retrieve anything from.

That's why I archive my document scans as one bit per pixel PNGs. It ends up being 20KB-50KB per page at 150 PPI. I figure that there will always be a way to get the pixels out of a PNG. PDF is a more complex and dynamic standard.

That's true, but PNGs can't have a text layer for searchability.

I'm pretty sure text files can reside in the same folder as PNG files.

Yes, but not all documents are trivial to convert to a text file because the layout can be quite complex. A PDF file can have little bits of text floating anywhere and when you search inside the file, you can see it highlighted at its actual position.

I've had to work with PDF files before, and they're absolutely horrible. Precisely because of what you state: "little bits of text floating anywhere". Or something like disjoint, not-grouped, lines for table drawing, instead of a generic table with formatting, width/height, etc.

Though, I generally agree with you, PDF/A is quite a good way of storing documents for long-term. But, that doesn't mean that PNG files along with text files, even with x:y coordinates next to the pieces of text, aren't a feasible alternative.

Consider archiving them as djvu (http://djvu.sourceforge.net/). One bit per pixel djvu files at 150PPI will likely become 2-5KB pages instead.

Djvu also supports a text layer just like PDF.

Note that 150PPI is barely better than FAX, so your documents will likely look 'faxed' if you ever have to output hardcopies for some reason.

Djvu is patented. As a result it is very possible that is will never achieve enough critical mass to be suitable for long term archiving.

People who worry about these things professionally generally would veer towards TIFF if PDF was insufficient. PDF/A does stuff like embed fonts and avoid proprietary compression & encryption, to avoid likely long term failure scenarios.

In the US, permanently retained documents like court records are kept in PDF. It will be around.

TIFF. Often expanded as "Thousands of Incompatible File Formats".

PDF is certainly more complex but its an ISO standard and even has an archival version: PDF/A.

FWIW, PNG is ISO/IEC 15948.

At least this is open source and you control your data vs whatever is today's popular mobile / SaaS app offering the same.

    Those can be easily and transparently backed up and searched.
I was recently asked if this is easy to do on Windows, especially the search part. What solution would you propose to someone who wants to index many PDF files already in such a folder structure?

