
Show HN: Open Paperless – Scan, index, and archive paper documents - zhoubear
https://github.com/zhoubear/open-paperless
======
Spearchucker
This is arguably a lot more than I need. I'm a hoarder in that I have every
email I've ever sent or received (bar junkmail), and every piece of paper I've
ever received.

Most of my paper is now scanned - I think I have two boxes left in my garden
shed. I don't bother with OCR because search doesn't help me when I don't know
what to search for (e.g. invoice for a jumper I bought in 2010 - fashion
labels rarely call their jumpers jumper).

And so I rely on meta data. There's not much out there in terms of open-source
tagging software, and even less in terms of an open tagging approach. I ended
up with tagspaces, which is a web app packaged up as a native app. The
approach to tagging is good (tags appended to file name), but the app is
abysmally poor. Slow - waiting up to 30 seconds for a pop-up menu to appear.
It assumes tag-based searches work in only one way.

The intent is to write some native apps to solve my biggest problems. For now
I'm still trying to clear the backlog of un-scanned paper docs (not going to
get this done for me, because privacy). I tag important stuff, like employment
contracts, mortgage agreements, passports and birth certificates...

Hope to have everything done by the time I cash in my chips. Might make for a
useful dataset for someone somewhere some day.

~~~
ishi
A few years ago I was involved with a startup that built a document management
system for consumers, and we actually got pretty good results with OCR +
automatic tagging based on a very simple database that maps keywords to tags.

Let's say you want to auto-tag bills and other documents from your ISP. So you
add the ISP's name, phone number, website address etc. into the database - any
uniquely-identifying keywords that typically appear on the documents that they
send. Now any document that contains these keywords will get tagged as "ISP",
making it very easy to find in the future.

Even if the OCR quality isn't perfect, at least one of these keywords will
most likely get matched.

Another example - you could add the names of your family members as keywords,
making it easy to find all documents related to Jenny or Susan.

You could argue that full-text search would achieve the same result, but
uploading documents into the system and having them auto-tagged as "ISP",
"car-payments", "Walmart", "Susan" and so on feels a little bit like magic, as
if the system is actively helping you organize your papers.

The keyword approach is also very easy to understand and tweak, unlike more
rigorous but opaque methods of document clustering (such as tf-idf).

~~~
myaso
Out of curiosity what is the state of the art today for extracting text or
other data from scanned documents (forms, legal docs, receipts, etc) ?

~~~
matt_the_bass
I don't have an exact answer but can tell you that Expensify still resorts to
human parsing sometimes. How often "sometimes" is, I have no idea. I would
guess a lot.

------
theomega
I want to show an alternative approach to managing your documents:

Store them in your IMAP/Mails. Either on an own account or in a dedicated sub-
folder.

I wrote some small python scripts [1] which allow you to: \- Add an email with
the PDF attached to your document collection. The script supports adding a
subject and adding tags to it \- Go over all the emails and run an OCR
(tesseract) on them: Attach the OCR result together with the pdf to the email.

Big advantage: \- Search on IMAP is a solved problem \- Clients for every
operating system in the world, including web, mobile \- Super simple backup
and restore

Over course, very geeky, nothing for your parents, but maybe something for
you?

[1]:
[https://github.com/theomega/IMAP_DMS](https://github.com/theomega/IMAP_DMS)

~~~
JohnStrange
Don't you have to run your own IMAP server for that to work?

Although my mail provider is fairly generous about storage space, it's not
unlimited.

~~~
theomega
Depends on two things: Your space and your privacy requirements. Google Mail
works for example if you are willing to trust Google. A lot of email providers
offer you a lot of space.

------
jopsen
Question: why bother organizing papers?

I just throw everything in a box, if I ever need it again later it'll take a
long time to find.. but I rarely need to find a document again.

Complexity of archiving a document is O(1) with a very small constant.
Complexity of retrieval is O(N) for a large N.

But I have few retrievals in my system, so why pay a higher per document cost?

~~~
tombrossman
> Question: why bother organizing papers?

Because being organised makes you more effective. With your 'throw it all in a
box' system, you have a high barrier to finding documents in the future and
this discourages you from doing so. However, with a more organised approach
you are more likely to retrieve specific documents.

One example: Some mid-priced electronic device breaks a few months after you
buy it. You might weigh digging through all the paperwork versus shrugging
your shoulders and throwing it away. I would go straight to the warranty
document and also look at my credit card issuer's warranty/returns policies(if
any), and I would return the item for a replacement or refund. No biggie, only
a few minutes work and I as a consumer prevail in exercising my rights.

Sounds boring but I believe it is definitely worth making the effort.

~~~
jcelerier
> With your 'throw it all in a box' system, you have a high barrier to finding
> documents in the future and this discourages you from doing so.

In fifteen years of keeping my mail I maybe had once or twice to go back in
time more than a month or two ago.

~~~
purerandomness
Then that system obviously doesn't solve one of your pains.

I have to dig out older documents almost daily.

~~~
matt_the_bass
What types of things are you looking for so frequently? Maybe you could
organize a subset of papers?

------
pingec
Does anyone know any similar free/open products for archiving documents,
tracking etc.?

What I am after is a system like expensive solutions have in some companies
where the mailbox department prints (or has preprinted) labels with unique bar
codes, for any incoming mail, they open it, stick a label on it, scan it with
the label on it and then physically deliver it. Some departments also input
recipient and sender details, add tags etc. So in the end they have a
searchable database by persons involved, content type, tags and also all
documents (physical and digital) have a referenceable id that can be used for
various purposes.

------
prashnts
I've been using iOS and Mac's native notes app to do that. In my opinion what
these solutions lack is an integration between both note-taking (I sometimes
like to write a few sentences relevant to a document, and I'd like to have it
shown right next to it) while also letting you have the individual documents
available in PDF or whatever if you need. Notes app does it perfectly now
after iOS 11.1 and High Sierra.

An example is this screenshot from my notes
[https://imgur.com/a/xuZqW](https://imgur.com/a/xuZqW)

~~~
mark_l_watson
I used to use Notes but stopped after trying to back up my notes. For me,
exporting one note at a time to pdf is not good enough, and finding the opaque
binary file in ~/Library does not help because it is not a standard file
format.

I switched to using Notes in Fastmail.

~~~
davidrupp
I've been using
[http://writeapp.net/notesexporter/](http://writeapp.net/notesexporter/) for
Mac's Notes app; happy with it so far.

------
pw0nka
Looks great. Love the idea behind it, but...

There is at least one country (mine - Switzerland) which is not able to use
software like yours. The problems are the current laws that force people and
organizations to store physical copies of the documents (for several years).
Electronic documents have no value in front of the law, which is why we have
no choice but to do all of that offline, manually.

I've tried many archiving solutions, but non of them saved any bit of time.
The one single, missing feature was an automatism to print a serial code (the
electronic document ID) back on the original document. This way you could just
scan it, print it, put it in a large box where you sort it by its ID - that
simple. And this would even work if you would use spacers to split the
documents on the scanning process.

~~~
jopsen
I have always archived my documents by throwing them in an unsorted black box.

If someone really need me to retrieve an old document. It'll take forever to
find, but why would I want to pay sorting costs upfront?

~~~
copperx
Because you often need documents on stressful occasions such after the death
of a family member, after an accident, after losing your job, after an IRS
audit. You really want to be going over n documents with the possibility of
missing one or more during such times?

------
y4mi
a nontrivial name conflict with Paperless
([https://github.com/danielquinn/paperless](https://github.com/danielquinn/paperless))
...

------
tjoff
I don't know what Mayan EDMS is and all this readme does is saying what it is
in relation to Mayan EDMS. Extremely frustrating.

~~~
Fnoord
According to the documentation it is Ubuntu only, as it requires Ubuntu 16.10
or later. What about other Linux distributions? No mention of the other 2
popular desktop OSes, Windows and macOS?

------
carwyn
There's also this paperless
[https://github.com/danielquinn/paperless](https://github.com/danielquinn/paperless)

------
curioussavage
Any good open source desktop software with linux support to do this? I don't
see why I would personally want a web app for this.

~~~
joelhaasnoot
It's a little clunky but here's the one I found best that just worked on
Ubuntu: [http://gscan2pdf.sourceforge.net/](http://gscan2pdf.sourceforge.net/)
. It can combine some of the best tools for OCR/cleanup/etc.

My main gripe is that I have a document feeder and manually selecting pages
with shift to combine in to a single document and clicking "Save as" is far
too much of a hassle. There needs to be a better flow for that.

~~~
coaxial
I wrote a collection of bash scripts for that.
[https://github.com/coaxial/insaned-
config](https://github.com/coaxial/insaned-config)

It was initially to use with insaned, but I later came up with a script to tie
it all together (scan.sh) because it's faster than jamming the scan button
waiting for insaned to register. And with the script, I can queue commands
provided I'm fast enough to swap the physical pages in the flatbed scanner.

It also uses the excellent textcleaner imagemagick script to clean up the
scans and make them more ocr friendly.

The readme isn't totally up to date, parallel isn't required anymore, and
there is no mention of the scan.sh script. But when you run it, it prompts for
commands. You might need to edit the scripts to set your own output
directories and textcleaner location.

------
karinato
For those wondering about the relationship between Mayan EDMS, Paperless and
Open Paperless here is a story line summary of the saga.

Roberto Rosario (the creator of Mayan) is a very well known name in the
Django, Python, document management, maker, hacking, open health and open
source in the goverment circles.

\- [https://speakerdeck.com/siloraptor](https://speakerdeck.com/siloraptor) \-
[https://en.wikipedia.org/wiki/Roberto_Rosario](https://en.wikipedia.org/wiki/Roberto_Rosario)
\- [https://www.pycon.it/conference/p/roberto-
rosario](https://www.pycon.it/conference/p/roberto-rosario) \-
[http://pyvideo.org/djangocon-us-2014/liberation-and-
moderniz...](http://pyvideo.org/djangocon-us-2014/liberation-and-
modernization-of-government-legacy.html) \-
[https://cpucadviceletters.org/login/?next=/](https://cpucadviceletters.org/login/?next=/)
\- [https://twit.tv/shows/floss-
weekly/episodes/253](https://twit.tv/shows/floss-weekly/episodes/253) \-
[https://en.wikipedia.org/wiki/Mayan_(software)](https://en.wikipedia.org/wiki/Mayan_\(software\))
\-
[https://www.youtube.com/watch?v=rubzEAojf-k](https://www.youtube.com/watch?v=rubzEAojf-k)

Mayan EDMS was initially released in February 3, 2011 (Wikipedia and git log).
In June 2015, Roberto gave a workshop in DjangoCon named From zero to
paperless with Mayan EDMS
([https://archive.is/FDpYS](https://archive.is/FDpYS)). Daniel Quinn (the
creator of Paperless) also attended and presented at the same DjangoCon event
([https://vimeo.com/135907408](https://vimeo.com/135907408)) and 6 months
later after working on it for several months (Daniel's own words), he released
Paperless on December 20, 2015
([https://github.com/danielquinn/paperless/commits/master?afte...](https://github.com/danielquinn/paperless/commits/master?after=af4623e60563f5e4328e87ec8027f79804f8d08a+559)).
By January 24, 2016, Paperless had "exploded in popularity"
([https://twitter.com/danielagquinn/status/691242822431830016](https://twitter.com/danielagquinn/status/691242822431830016)).

Both projects used Python, Django, same Django 3rd party apps like DjangoSuit,
same document consumer model, same OCR engine, REST API, among other things.
On the surface it appeared that Paperless was a copy of Mayan EDMS concepts
and implementations without giving credit or mention. Many additions were
planned for Paperless that were features and implementations already in Mayan
([https://www.reddit.com/r/selfhosted/comments/44mh88/scan_ind...](https://www.reddit.com/r/selfhosted/comments/44mh88/scan_index_and_archive_all_of_your_paper_documents/)).

A separate point of contention was that the name "Paperless" had been in use
by other projects much earlier that Daniel's Paperless
([https://github.com/search?utf8=%E2%9C%93&q=paperless&type=](https://github.com/search?utf8=%E2%9C%93&q=paperless&type=)).
Since there is no trademark on the name or description, other projects
appeared with the same name and description
([https://github.com/lrnt/paperless](https://github.com/lrnt/paperless)).

On March 15, 2016, Daniel presented Paperless at CodeNode
([https://skillsmatter.com/skillscasts/7843-intro-to-
paperless](https://skillsmatter.com/skillscasts/7843-intro-to-paperless)).

It was Daniel's February 27, 2016 tweet suggesting to be paid to work on
Paperless that sparked the animosity between the users of the two projects
([https://twitter.com/danielagquinn/status/703629488932970500](https://twitter.com/danielagquinn/status/703629488932970500)).

Many heated debates ensued. Even then, the main critique of Paperless remained
technical, but lack of maturity and implemenation was described by one Reddit
users as: "I've looked into paperless and it currently lacks a lot of...nearly
well everything. Maybe in a year or two"
([https://www.reddit.com/r/linux/comments/6m9evn/want_to_go_pa...](https://www.reddit.com/r/linux/comments/6m9evn/want_to_go_paperless_looking_for_dms/dk1cjz0/))

On April 9, 2016, Daniel added a reference to Mayan to the documentation of
Paperless
([https://github.com/danielquinn/paperless/commit/674d54ec3878...](https://github.com/danielquinn/paperless/commit/674d54ec38783b02350c1371bdf0f412dd765ef0#diff-88b99bb28683bd5b7e3a204826ead112)).

On April 17, 2016, Daniel posted on his old twitter account: "It looks like my
idea for Paperless wasn't all that unique. This other project uses a lot of
the same tools: [http://www.mayan-edms.com"](http://www.mayan-edms.com")
([https://twitter.com/danielagquinn/status/721726208606646272](https://twitter.com/danielagquinn/status/721726208606646272)).

On April 14, 2017, Daniel Quinn posted in his blog a summary of his
experiences at DjangoCon Europe 2017 where he mentions meeting Roberto in
person. He describes Roberto as a "rival geek" in what appears to be jest and
uses positive adjectives to describe Roberto in the rest of the post.
([https://danielquinn.org/blog/djangocon-2017/](https://danielquinn.org/blog/djangocon-2017/))

On April 16, 2017 Daniel posted a tweet mentioning the popularity Paperless
([https://twitter.com/danielagquinn/status/853701257051205632](https://twitter.com/danielagquinn/status/853701257051205632)).

The last release of Paperless is made on Sep 9, 2017.

On Oct 18, 2017 Daniel posted: "I changed my Twitter name! This isn't me any
more, so if you're looking for me, you should keep head over to
@danielagquinn."
([https://twitter.com/searchingfortao/status/92077862371561062...](https://twitter.com/searchingfortao/status/920778623715610624)).
Only 7 commits have been made to Paperless since with the last commit
happening on Novermber 5, 2017.

On December 18, 2017 a user named "zhoubear" anounced on Reddit's selfhoted
"Open Paperless: Scan, index, and archive all of your paper documents"
([https://www.reddit.com/r/selfhosted/comments/7kjocg/scan_ind...](https://www.reddit.com/r/selfhosted/comments/7kjocg/scan_index_and_archive_all_of_your_paper_documents/)).
It turned out that Open Paperless was a forked Mayan EDMS with cosmetic
changes but with copyrights changed and no attribution to Mayan EDMS. After a
much heated debate, copyrights and attributions were restored and the
project's description has been updated to show that it is a new front end for
Mayan among other usability changes meant for home users.

In 4 days, Open Paperless has surpassed Mayan EDMS in popularity on Github.

No posts or comments from Roberto can be found in reference of Paperless or
Open Paperless.

[https://twitter.com/search?q=paperless%20from%3Asearchingfor...](https://twitter.com/search?q=paperless%20from%3Asearchingfortao&src=typd)

~~~
kerridge0
Mayan isn't hosted on GitHub so that may explain the difference in popularity.

------
ikawe
Let's put this in a room with [The Screenless
Office]([https://news.ycombinator.com/item?id=15960056](https://news.ycombinator.com/item?id=15960056))
and see what happens.

~~~
beamatronic
Is that the one where you scan a barcode and they mail you a printout of a web
page ?

------
SomewhatLikely
Something I've wanted that might be possible is software that takes in a video
of me flipping the pages of a notebook and converts that to a PDF of the
notebook.

~~~
mdaniel
I regret that I can't immediately find the video which discussed it, but this
gets in the ballpark of what I saw:
[https://www.researchgate.net/publication/271462470_OCR_from_...](https://www.researchgate.net/publication/271462470_OCR_from_Video_Stream_of_Book_Flipping)

IIRC, it wasn't ~vaporware~ researchware, but nor was it "clone this repo,
away you go"

------
bob_theslob646
Please correct me if I am wrong, but this looks like you have to "name" each
page. I would also want to see how accurate the ocr is. Historically, ocr on
handwritting has been a problem unless the data is perfectly formatted. I
guess the case is just to get enough accuracy so that you can look for or at
the image of that page with the indexed search term you were looking for.

------
mickael-kerjean
Well done! Will definitly give it a try back home !

------
mauritzio
Maybe it would be better to "archive" on good paper (encoded) Can not imagine
a 1000 year old magnetic device... ;)

------
gravypod
Will this automatically center and apply perspective transforms to pictures
taken with phone cameras?

~~~
zhoubear
Not at the moment. I'm guessing this information is available in the EXIF
properties?

~~~
joshvm
It's a little bit more tricky than that. What the EXIF _might_ tell you is the
camera calibration parameters like focal length, distortion, perspective
center, etc. That can be used to fix systematic errors in images like
pincushion/barrel distortion.

To unwarp photos that were taken at odd angles you need to do some image
processing. The mathematics aren't particularly difficult, it's a homography
transform in most cases (rectangles). The problem is robustly detecting the
page.

Dropbox has some nice write-ups on this:
[https://blogs.dropbox.com/tech/2016/08/fast-document-
rectifi...](https://blogs.dropbox.com/tech/2016/08/fast-document-
rectification-and-enhancement/)

~~~
zhoubear
Thanks for the link. That blew my mind! I wish it could be added would, my
phone would replace my scanner instantly.

~~~
gravypod
Also see this for more implementation details:
[https://www.pyimagesearch.com/2014/09/01/build-kick-ass-
mobi...](https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-
document-scanner-just-5-minutes/)

This could be done server side if you're already doing ocr.

~~~
zhoubear
Wow this looks awesome, thank you!

------
rootsudo
Okay, wow, this is cool.

------
EGreg
I have a question

Is there a service anyone knows about which will print your email and send it
with tracking of receipt or signature, so you can prove what was physically
sent?

Or you mail it to them and they open your mail, scan it and forward it on with
signature required, with your address as the return address?

Because righy now you can only prove that the ENVELOPE was received, not what
was in it.

~~~
craftyguy
you can just ask your recipeint to send you some uuid on the letter after they
receive it. i have no idea what problem you are trying to solve though.

~~~
y4mi
i once heard of a case were the employer send an unrelated document to an
employee.

later on, the payment stopped and the employer claimed that they fired him at
that time.

I don't recall how that case ultimately turned out, but maybe something like
that? would be incredibly rare though and of dubious worth for mostly anyone

~~~
craftyguy
I believe there are courier services that help guarantee document delivery to
the correct person, i.e. for sending/serving court summons.

