
Compressing and enhancing hand-written notes (2016) - pablode
https://mzucker.github.io/2016/09/20/noteshrink.html
======
alister
I have an observation about scanning documents that results in good quality
and smaller files, but I can't satisfactorily explain why it works. Consider
these two cases:

(1) Scan document at very high resolution as a JPG and then use a third-party
program (like Photoshop or whatever) to re-encode the JPG at your preferred
low resolution.

(2) Scan document at your preferred low resolution as a JPG straight away.
Don't re-encode afterward.

Intuition says that the results of #1 vs #2 should be identical, or that #1
should be worse because you're doing two passes on source material. But I
always get better results with case #1 (i.e., high-res scan and re-encoding
afterward) regardless of the type or model of scanner, or whether the scanner
does the JPG encoding on-board the device itself or through a
Windows/Linux/Mac driver bundled with the scanner.

My theory is that scanner manufacturers are deliberately choosing the JPG
encoding profile that gets them the fastest result. They want to brag about
pages per minute which is an easily measured metric. Quality of JPG encoding
and file size take effort to compare, but everyone understands pages per
minute.

If anyone has contrary experience I'd like to hear it. I've been seeing this
for years with different document scanners and flatbed scanners -- regardless
of how I tweak the scanner's settings, I can always get good quality in a
small file by re-encoding afterward.

~~~
sp332
If you're scanning at a lower resolution, the scanner has fewer samples to
work with when trying to make a visual representation of your document. If you
scan at a higher resolution, the algorithm could at the very least average
together nearby samples. It could also detect sharp lines vs fuzzy borders and
decide whether the low-res version has a sharp transition or an averaged color
between areas.

~~~
Scaevolus
Low-resolution scans are _faster_ than high-resolution scans for this exact
reason: the sampling is worse. There might be the same number of samples per
line captured by the linear CCD (and then downsampled), but the distance
between scanlines varies by DPI.

------
nayuki
On the top image, I see that the back side of the page has clearly leaked
through. In my experiences with scanning paper, I found a trick that
essentially eliminates any visible backside content: Using a flatbed scanner,
I would scan with the lid open, and the room darkened.

The worst thing to do is to scan with the lid closed, with a lid that has a
white background. This would increase the reflection from the backside of the
page.

~~~
MagerValp
You achieve the same effect with a black paper on top of the document you’re
scanning, or in between the pages if it’s a book. As a bonus you can leave the
light on :)

~~~
nayuki
Indeed,
[https://en.wikipedia.org/wiki/Vantablack](https://en.wikipedia.org/wiki/Vantablack)
would be great here.

------
pjc50
Nice to see a bit of k-means clustering. I was worried that this might attempt
to be "smart" by converting to symbols, replicating the "Xerox changes numbers
in copied documents" bug, but it's pure pixel image processing.

Very clean results. In some ways it's a smarter version of the "posterize"
feature.

~~~
wmu
Here is the link to the blogpost that describes the problem:
[http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_...](http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_are_switching_written_numbers_when_scanning)?

------
kazinator
I can get seemingly comparable results with a couple of simple operations in
Gimp.

Here is a casual job on the first image:

[https://i.imgur.com/Sy2rvsU.png](https://i.imgur.com/Sy2rvsU.png)

The steps:

1\. Duplicate the layer.

2\. Gaussian-blur the top layer with big radius, 30+.

3\. Put the top layer in "Divide" mode. Now the image is level.

4\. Merge the layers together into one.

5\. Use Color->Curves to clean away the writing bleeding through from the
opposite side of the paper.

6\. To approximate the blurred look of Matt Zucker's result, apply Gaussian
blur with r=0.8.

Notes:

The unblurred image before step 6 is here:
[https://i.imgur.com/RbWSUnD.png](https://i.imgur.com/RbWSUnD.png)

Here is approximately the curve used in step 5:
[https://i.imgur.com/lvfqCNK.png](https://i.imgur.com/lvfqCNK.png)

I suspect Matt worked at a higher resolution; i.e. the posted images are not
the original resolution scans or snapshots.

~~~
fouc
You can get the higher resolution one here:
[https://github.com/mzucker/noteshrink/blob/master/examples/n...](https://github.com/mzucker/noteshrink/blob/master/examples/notesA1.jpg)

BTW I'm curious how you'd fare on the graph one. I didn't like his results for
it.
[https://github.com/mzucker/noteshrink/blob/master/examples/g...](https://github.com/mzucker/noteshrink/blob/master/examples/graph-
paper-ink-only.jpg)

~~~
kazinator
[https://imgur.com/a/dsLhk](https://imgur.com/a/dsLhk)

Note how the grid is completely gone, the Sharpie strokes are fuller and the
ghosting around the red ink is gone. (The word "Red" seems to have been
written faintly, like with a non-working ball point pen, and then written over
properly.)

The thing is, I took a completely different approach here. I won't give a
complete step-by-step recipe, but the gist of it is this:

1\. Create a copy layer of the image.

2\. Optionally level the intensity with the divide trick; I didn't bother.

3\. Convert this copy to grayscale.

4\. Threshold it to black and white, such that the grid is eliminated, but the
writing remains solid.

5\. Blur the writing (radius 3-4).

6\. Threshold again.

7\. Now you have a black and white version that is a bit thicker than the
original. _TURN THIS INTO A LAYER MASK_. An inverted one which passes through
the writing, and renders everything else transparent.

8\. Apply this mask to the original image. This requires transfering a layer
mask between layers.

9\. Slide a white background under the masked layer. Now you have the
lettering clean on white.

10\. Play with simple Color->Brightness-Contrast. I ended up with something
like brightness -66, contrast +88.

In the final step, because of the layer mask that is in effect, these controls
affect only the writing: the white coming from the unaffected layer below
stays white no matter what you do with the contrast and brightness controls.

 _Why the different approach:_ I first tried the original approach and the
result was good. But I thought you wouldn't like it either. It was similar to
Matt's. I did a better job of eliminating the grid, but the writing was less
vivid. (Likewise I also preserved the yellow tint of the paper.) I wanted the
grid completely gone, with vivid writing. Playing with the intensity transfer
curves was not quite doing it; there was poor separation between vanshing the
grid while preserving the ink.

This attempt can be seen here:
[https://imgur.com/a/ldrBN](https://imgur.com/a/ldrBN) The green writing is
particularly unsatisfactory.

------
keenerd
A simpler way of achieving the same thing is to duplicate the layer, blur the
top layer heavily, and then set it to "divide".

~~~
donquichotte
Really? Do you care to explain? What is the dividend and what is the divisor?
Why can dividing a image by its low pass filtered version (or vice versa) be
used to "clean up" the image, i.e. subtract the background, find main colors
and cluster similar colors with k-means? What if the divisor has pixels near
zero?

~~~
keenerd
Areas of low contrast become whiter and areas of high contrast become more
saturated.

It is also more robust than k-means. The author's algo will only work on
scanned images. Photographed pages from a book will often have a slight shadow
on half the page from the curvature. Blur-divide will clean this up. K-means
will think you've used a lot of gray and not figure out that there are
multiple background colors.

~~~
333c
I can confirm that the author's approach doesn't work well for photographed
pages. I took a photograph[0] of a page of notes, and due to the shadow, the
results[1] were very unsatisfactory.

[0]: [https://i.imgur.com/CLZHshT.jpg](https://i.imgur.com/CLZHshT.jpg)

[1]: [https://i.imgur.com/rrwca0m.jpg](https://i.imgur.com/rrwca0m.jpg)

------
trurl42
This reminds me of my time in university, when I saved all my lecture notes as
DjVu [1] files.

It's a great file format for space-efficient archiving of scans like that,
with a bit of scripted preprocessing.

[1]: [https://en.wikipedia.org/wiki/DjVu](https://en.wikipedia.org/wiki/DjVu)

~~~
dunham
I like the idea, but DjVu seems to be very proprietary / single vendor and not
in widespread use. This has made me reluctant to use it for archival purposes
(vs say PDF, which has its own issues, but feels slightly more future proof to
me).

I think PDF can cover pretty much the same ground with JBig and Jpeg2k. (And I
believe archive.org is doing that.) But I don't know of any open source code
to do the segmentation / encoding. (You have to split the bitmap from the
background for jbig / jpeg encoding.)

~~~
pwg
There is an open-source DjVu library as well:

[http://djvu.sourceforge.net/](http://djvu.sourceforge.net/)

Whether that makes the format "widespread" enough for your use case is of
course your decision to make.

------
Softcadbury
I wonder if your technique could remove some lines for the paper we use in
France [1].

I never really understood why they were so many lines...

[1]: [https://images-na.ssl-images-
amazon.com/images/I/815WQQdAHBL...](https://images-na.ssl-images-
amazon.com/images/I/815WQQdAHBL._SL1500_.jpg)

~~~
John_KZ
Is this really standard writing paper? I assume it would be useful for
calligraphy or learning how to write (as you can use the subdivision to draw
letters to the correct height) but I find it weird for it to be standard issue
paper.

~~~
pfalke
It is. It's called "French-ruled paper" and is the school standard in France.
[0]

[0] [https://getfrenchbox.com/what-is-french-ruled-paper-seyes-
sc...](https://getfrenchbox.com/what-is-french-ruled-paper-seyes-school/)

~~~
bb88
My handwriting in school would have been better if I could have used this.

------
haikuginger
I wonder if it'd be possible to do automatic detection and removal of notebook
lines via an FFT (frequency domain) transform.

------
eltoozero
For anyone having issues getting this to work on macOS with homebrew
dependencies, I was able to get it to work after finally getting an old
version of numpy installed using the following command.

    
    
      sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy==1.9.0
    

If you don't use the numpy==1.9.0 you'll get the 1.14.2 version which is also
broke.

The rest of the options allow pip to soft-override the macOS built-in numpy
1.8.0 which is immutable in the /System/ directory.

Anyway, after I did all that I was able to start playing with the app, I had
previously been using a kludge workflow to get a nice output in black and
white by using the imagemagick convert -shave option to remove the scanned
edges of images, then doing a -depth 1 to force the depth down (which only
works well on really clean scans), then I can -trim to clear the framing white
pixels and re-center using the -gravity center -extent 5100x6600 to frame the
contents centered inside a 600dpi image.

Rough but it works, I was hassling with trying to isolate "spot colors" for
another thing, but this might actually do the trick!!!

------
Myrmornis
This is awesome, and a depressingly large factor better than any blog post
I’ll ever write.

I totally identify with the need for this. I also want to archive images of
notes and whiteboards, and they must be kept small as so far my life fits in
google drive and github.

Currently I use Evernote to do this. I don’t use any other functionality in
Evernote but the “take photo” action does processing and size reduction very
like the blog post.

------
reaperducer
Great job with that. I've only just started taking notes by hand once again,
after being keyboard-only for many years.

In your scenario, since you have assigned "scribes" taking the notes, you
might be able to streamline the process with a "smart pen."

There are several on the market. The one I got as a hand-me-down from a family
member lets you write dozens of pages of notes, then Bluetooth them to a
smartphone app that exports to PDF, Box, Google Drive, etc... Or it can
actually copy the notes to the app in real time. Combined with a projector,
this might be useful for the other students during class.

It's supposed to be able to OCR the notes, too, but I haven't bothered to
figure out how. But there's a cool little envelope icon in the corner of each
notebook page that if you put a checkmark on, it will automatically e-mail the
page to a pre-designated address.

Again, there are several models on the market. Mine retails for about $100.
Notebooks come in about 15 different sizes and cost about the same as a
regular quality notebook.

Just some thoughts.

~~~
inetknght
I have found that my Galaxy Note 2014 is pretty much hands-down the best note
taking tablet in my opinion. It's better than the crap that Microsoft and
Apple are trying to hawk off. It doesn't have as many fancy apps but for
_strictly_ note taking, sharing notes via email, and book reading, it's pretty
awesome.

I just wish its price would come down. It's still full price from four years
ago :| and even getting more expensive because it's so old

~~~
rripken
You are referring to the 10.1 tablet? I know its not the same but I had a
Note4 and was pleased with its note taking ability. I imagine that with the
extra screen real-estate the tablet was even better. It looks like they are
$453 on Newegg! I agree that does seem crazy for an old device. If you can put
up with a used device there are two on swappa for around $200
[https://swappa.com/buy/samsung-galaxy-
note-101-2014-wifi](https://swappa.com/buy/samsung-galaxy-note-101-2014-wifi)
If it truly is the hands-down best note taking tablet you might as well buy
them both and standardize to the platform.

~~~
inetknght
Yes, this is exactly!

However, I prefer the Wifi-only model. I neither need nor want cell service on
my tablet. Not only that, but cellular providers end up installing complete
garbage for their software.

I recently broke my first device. I bought a new one that was advertised Wifi-
only. The received device was branded for Verizon and was literally and
completely unusable without a Verizon SIM card. I ended up getting it replaced
with a T-Mobile one which... while isn't not completely unusable, there's
still tons of crapware from T-Mobile that I cannot uninstall.

It makes me really sad.

I'll have to check out your link at home. Thanks!

------
dracodoc
I used to use a free software "ComicEnhancerPro" (The author is Chinese, there
is English version but may not easy to find reliable download site) specially
designed to enhance scanned comics.

You can remove the background very effectively by dragging a curve with
preview.

You almost always need to preview and adjust some parameters, unless you have
a template for similar cases.

------
goerz
In terms of compression for scanned notes, I haven't found anything that comes
close to what even an older version of Adobe Acrobat yields, due to the use of
the JBIG2 codec. Has anybody found any way to compress PDF files with JBIG2 on
Linux/Mac? It's pretty much the only reason I have to find a Windows machine
with Acrobat installed a couple of times a year, to postprocess a batch of
scanned PDFs.

~~~
ramses0
`German and Swiss regulators have subsequently (in 2015) disallowed the JBIG2
encoding in archival documents.[19]`

~~~
goerz
Yeah, but I'm not archiving for the German or Swiss government. For scanned
handwritten notes, JBIG2 still beats out anything else by an order of
magnitude at least

~~~
striking
JBIG2 has been known to swap 6s and 8s. Use it at your own risk. There's a
reason nothing uses it anymore.

~~~
goerz
Is there an alternative that provides comparable compression?

------
BlackLotus89
Looks interesting. Normally when I'm "cleaning" up scans I use unpaper, but
although there is some overlap in functionality it doesn't do the same.

Anyway very nice writeup and I will add it to my arsenal and give it a closer
look later. Could be useful for my document archive+ocr solution.

Edit: too bad seems like it didn't see any activity in the last year

~~~
eadmund
> Edit: too bad seems like it didn't see any activity in the last year

That's not necessarily bad: sometimes a piece of software can be done, or
nearly so.

~~~
BlackLotus89
Yup but a project like this would have an empty issue tracker this one not so
much ;)

(which doesn't mean that it's a bad project or that I wont use it. it means
that I will probably start to work on it)

~~~
foleac
There actually was some activity in three different branches in january:
[https://github.com/kskyten/noteshrink/network](https://github.com/kskyten/noteshrink/network)

~~~
BlackLotus89
Yeah on a branch of a fork.

------
andimai
How does this method compare to adaptive thresholding or Otsu's binarization
method?

[https://docs.opencv.org/3.4.0/d7/d4d/tutorial_py_thresholdin...](https://docs.opencv.org/3.4.0/d7/d4d/tutorial_py_thresholding.html)

------
amai
A lot of apps out there which can do this for you: [http://uk.pcmag.com/cloud-
services/86200/guide/the-best-mobi...](http://uk.pcmag.com/cloud-
services/86200/guide/the-best-mobile-scanning-apps-of-2018)

------
jonathanyc
The Image Capture app on macOS is surprisingly good at this, and one of the
things I really miss on Linux and Windows, so this is neat to have.

It’s also interesting for me to think about how this is a generalization of
converting a scan to black and white for clarity :)

------
anc84
I wonder if it might improve by using a better color space than HSV. Maybe
CieLAB?

~~~
jerf
Is there much room for improvement? Looks pretty good to me.

It seems to me that the inaccuracies/inefficiencies/errors/whatever you like
in using RGB are basically truncated out of existence by the very, very harsh
binning that is occurring. I wouldn't expect any visible differences to emerge
from any alternate color space.

------
krsree
The link says that generated pdf is a container for the png or jpg image. Is
it possible to get a true pdf from the scan? Specifically so that i can search
inside the pdf.

------
davidzweig
Also, see ScanTailor by Joseph Artsimovich. Excellent tool.

------
herpaderp_33
It would be interesting to see this paired with potrace.

------
freecodyx
interesting, i was working on something similar, (get color palette from an
image).

------
neelkadia
This is GEM!

------
dontyouremember
Can be incrementally improved by using a more human-focused color model than
HSV, like CIECAM02 or CIELAB.

