Hacker News new | past | comments | ask | show | jobs | submit login
Compressing and enhancing hand-written notes (2016) (mzucker.github.io)
639 points by pablode on Mar 12, 2018 | hide | past | web | favorite | 70 comments

I have an observation about scanning documents that results in good quality and smaller files, but I can't satisfactorily explain why it works. Consider these two cases:

(1) Scan document at very high resolution as a JPG and then use a third-party program (like Photoshop or whatever) to re-encode the JPG at your preferred low resolution.

(2) Scan document at your preferred low resolution as a JPG straight away. Don't re-encode afterward.

Intuition says that the results of #1 vs #2 should be identical, or that #1 should be worse because you're doing two passes on source material. But I always get better results with case #1 (i.e., high-res scan and re-encoding afterward) regardless of the type or model of scanner, or whether the scanner does the JPG encoding on-board the device itself or through a Windows/Linux/Mac driver bundled with the scanner.

My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result. They want to brag about pages per minute which is an easily measured metric. Quality of JPG encoding and file size take effort to compare, but everyone understands pages per minute.

If anyone has contrary experience I'd like to hear it. I've been seeing this for years with different document scanners and flatbed scanners -- regardless of how I tweak the scanner's settings, I can always get good quality in a small file by re-encoding afterward.

In addition to some other points, the downscaling step in #1 may also smooth out some noise in the source image. Less noise yields more compressible data.

In my own scanning workflow, I scan at 600 or 1200dpi to png, deskew, downscale, and apply a black/white threshold. This is all done with imagemagick: mogrify -deskew 40% -scale 25% -threshold 50%

If I want a PDF after that I'll use img2pdf.

If you're scanning at a lower resolution, the scanner has fewer samples to work with when trying to make a visual representation of your document. If you scan at a higher resolution, the algorithm could at the very least average together nearby samples. It could also detect sharp lines vs fuzzy borders and decide whether the low-res version has a sharp transition or an averaged color between areas.

Low-resolution scans are faster than high-resolution scans for this exact reason: the sampling is worse. There might be the same number of samples per line captured by the linear CCD (and then downsampled), but the distance between scanlines varies by DPI.

> My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result.

This is more-or-less correct. The chips in the printers have a lot less power than your CPU, and the algorithms are a lot worse than those in Photoshop.

I'm surprised there aren't any high-quality JPEG encoder ASICs.

Would scanners benefit from just using some of the plentiful, cheap, excellent-quality H.264 encoder ASICs—and treating the output as HEIF?

My guess slightly different: manufacturers are deliberately choosing the JPG encoding profile that gets them good quality in worst cases. Which also happens to get fast encoding and bigger files. Their motivation is simple. One case of negative user experience outweigh a thousand positive ones and hurts their reputation hard.

While the JPEG encoder in Photoshop is likely a lot better than what's in your scanner, I am, similarly to others here, fairly convinced that the majority of the difference is in the sample rate by the scanner.

I did some vanilla JS that simulates scaling from high DPI vs sampling directly at a specific DPI and the result does resemble the result of a low resolution scan.


I am aware optics does not work exactly work like this, but it's a reasonable approximation.

Of course you get better results with 1.

Here is why: if you go with (2), then (1) is still done: by the crap firmware in your printer or its driver. It scans at high resolution and then downsamples in some way over which you have no control. It might not even be done with floating-point math.

1. is a somewhat like getting the raw image from a camera: a higher quality source for your own processing.

On the top image, I see that the back side of the page has clearly leaked through. In my experiences with scanning paper, I found a trick that essentially eliminates any visible backside content: Using a flatbed scanner, I would scan with the lid open, and the room darkened.

The worst thing to do is to scan with the lid closed, with a lid that has a white background. This would increase the reflection from the backside of the page.

You achieve the same effect with a black paper on top of the document you’re scanning, or in between the pages if it’s a book. As a bonus you can leave the light on :)

Indeed, https://en.wikipedia.org/wiki/Vantablack would be great here.

Somebody is scratching his head now.

Nice to see a bit of k-means clustering. I was worried that this might attempt to be "smart" by converting to symbols, replicating the "Xerox changes numbers in copied documents" bug, but it's pure pixel image processing.

Very clean results. In some ways it's a smarter version of the "posterize" feature.

Here is the link to the blogpost that describes the problem: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...?

I can get seemingly comparable results with a couple of simple operations in Gimp.

Here is a casual job on the first image:


The steps:

1. Duplicate the layer.

2. Gaussian-blur the top layer with big radius, 30+.

3. Put the top layer in "Divide" mode. Now the image is level.

4. Merge the layers together into one.

5. Use Color->Curves to clean away the writing bleeding through from the opposite side of the paper.

6. To approximate the blurred look of Matt Zucker's result, apply Gaussian blur with r=0.8.


The unblurred image before step 6 is here: https://i.imgur.com/RbWSUnD.png

Here is approximately the curve used in step 5: https://i.imgur.com/lvfqCNK.png

I suspect Matt worked at a higher resolution; i.e. the posted images are not the original resolution scans or snapshots.

You can get the higher resolution one here: https://github.com/mzucker/noteshrink/blob/master/examples/n...

BTW I'm curious how you'd fare on the graph one. I didn't like his results for it. https://github.com/mzucker/noteshrink/blob/master/examples/g...


Note how the grid is completely gone, the Sharpie strokes are fuller and the ghosting around the red ink is gone. (The word "Red" seems to have been written faintly, like with a non-working ball point pen, and then written over properly.)

The thing is, I took a completely different approach here. I won't give a complete step-by-step recipe, but the gist of it is this:

1. Create a copy layer of the image.

2. Optionally level the intensity with the divide trick; I didn't bother.

3. Convert this copy to grayscale.

4. Threshold it to black and white, such that the grid is eliminated, but the writing remains solid.

5. Blur the writing (radius 3-4).

6. Threshold again.

7. Now you have a black and white version that is a bit thicker than the original. TURN THIS INTO A LAYER MASK. An inverted one which passes through the writing, and renders everything else transparent.

8. Apply this mask to the original image. This requires transfering a layer mask between layers.

9. Slide a white background under the masked layer. Now you have the lettering clean on white.

10. Play with simple Color->Brightness-Contrast. I ended up with something like brightness -66, contrast +88.

In the final step, because of the layer mask that is in effect, these controls affect only the writing: the white coming from the unaffected layer below stays white no matter what you do with the contrast and brightness controls.

Why the different approach: I first tried the original approach and the result was good. But I thought you wouldn't like it either. It was similar to Matt's. I did a better job of eliminating the grid, but the writing was less vivid. (Likewise I also preserved the yellow tint of the paper.) I wanted the grid completely gone, with vivid writing. Playing with the intensity transfer curves was not quite doing it; there was poor separation between vanshing the grid while preserving the ink.

This attempt can be seen here: https://imgur.com/a/ldrBN The green writing is particularly unsatisfactory.

Using a blurred version of the whole image as background is probably better than what the OP is doing, as I understand it he's treating the background as a single fixed color.

In particular my use case is cleaning up pictures of whiteboards, where the brightness from the room is not as constant as a scan, and the approach wouldn't work at all.

Probably an easy PR for that repo though?

A simpler way of achieving the same thing is to duplicate the layer, blur the top layer heavily, and then set it to "divide".

interesting trick, thanks for sharing ! [edit] Quick test here : https://imgur.com/a/6xOz1

I think that test illustrates it'll take quite a lot more to achieve what the tool in the article did...

Really? Do you care to explain? What is the dividend and what is the divisor? Why can dividing a image by its low pass filtered version (or vice versa) be used to "clean up" the image, i.e. subtract the background, find main colors and cluster similar colors with k-means? What if the divisor has pixels near zero?

Areas of low contrast become whiter and areas of high contrast become more saturated.

It is also more robust than k-means. The author's algo will only work on scanned images. Photographed pages from a book will often have a slight shadow on half the page from the curvature. Blur-divide will clean this up. K-means will think you've used a lot of gray and not figure out that there are multiple background colors.

I can confirm that the author's approach doesn't work well for photographed pages. I took a photograph[0] of a page of notes, and due to the shadow, the results[1] were very unsatisfactory.

[0]: https://i.imgur.com/CLZHshT.jpg

[1]: https://i.imgur.com/rrwca0m.jpg

This reminds me of my time in university, when I saved all my lecture notes as DjVu [1] files.

It's a great file format for space-efficient archiving of scans like that, with a bit of scripted preprocessing.

[1]: https://en.wikipedia.org/wiki/DjVu

I like the idea, but DjVu seems to be very proprietary / single vendor and not in widespread use. This has made me reluctant to use it for archival purposes (vs say PDF, which has its own issues, but feels slightly more future proof to me).

I think PDF can cover pretty much the same ground with JBig and Jpeg2k. (And I believe archive.org is doing that.) But I don't know of any open source code to do the segmentation / encoding. (You have to split the bitmap from the background for jbig / jpeg encoding.)

There is an open-source DjVu library as well:


Whether that makes the format "widespread" enough for your use case is of course your decision to make.

The major source for DjVu files I have run across is the Internet Archive's book scanning. (Weirdly, I can't find any examples.) They're usually smaller than PDFs.

I wonder if your technique could remove some lines for the paper we use in France [1].

I never really understood why they were so many lines...

[1]: https://images-na.ssl-images-amazon.com/images/I/815WQQdAHBL...

Is this really standard writing paper? I assume it would be useful for calligraphy or learning how to write (as you can use the subdivision to draw letters to the correct height) but I find it weird for it to be standard issue paper.

It is. It's called "French-ruled paper" and is the school standard in France. [0]

[0] https://getfrenchbox.com/what-is-french-ruled-paper-seyes-sc...

My handwriting in school would have been better if I could have used this.

I hadn't seen it before, but it seems to be pretty common for French-ruled paper. https://getfrenchbox.com/what-is-french-ruled-paper-seyes-sc...

It helps children learning how to write. Lowercase letters start from the thicker line to the first thinner line above. Uppercase letters and taller lowercase letters like "t" or "d" go to the second line. And the tails of letters like "g" or "y" go the first line below.

I wonder if it'd be possible to do automatic detection and removal of notebook lines via an FFT (frequency domain) transform.

For anyone having issues getting this to work on macOS with homebrew dependencies, I was able to get it to work after finally getting an old version of numpy installed using the following command.

  sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy==1.9.0
If you don't use the numpy==1.9.0 you'll get the 1.14.2 version which is also broke.

The rest of the options allow pip to soft-override the macOS built-in numpy 1.8.0 which is immutable in the /System/ directory.

Anyway, after I did all that I was able to start playing with the app, I had previously been using a kludge workflow to get a nice output in black and white by using the imagemagick convert -shave option to remove the scanned edges of images, then doing a -depth 1 to force the depth down (which only works well on really clean scans), then I can -trim to clear the framing white pixels and re-center using the -gravity center -extent 5100x6600 to frame the contents centered inside a 600dpi image.

Rough but it works, I was hassling with trying to isolate "spot colors" for another thing, but this might actually do the trick!!!

This is awesome, and a depressingly large factor better than any blog post I’ll ever write.

I totally identify with the need for this. I also want to archive images of notes and whiteboards, and they must be kept small as so far my life fits in google drive and github.

Currently I use Evernote to do this. I don’t use any other functionality in Evernote but the “take photo” action does processing and size reduction very like the blog post.

Great job with that. I've only just started taking notes by hand once again, after being keyboard-only for many years.

In your scenario, since you have assigned "scribes" taking the notes, you might be able to streamline the process with a "smart pen."

There are several on the market. The one I got as a hand-me-down from a family member lets you write dozens of pages of notes, then Bluetooth them to a smartphone app that exports to PDF, Box, Google Drive, etc... Or it can actually copy the notes to the app in real time. Combined with a projector, this might be useful for the other students during class.

It's supposed to be able to OCR the notes, too, but I haven't bothered to figure out how. But there's a cool little envelope icon in the corner of each notebook page that if you put a checkmark on, it will automatically e-mail the page to a pre-designated address.

Again, there are several models on the market. Mine retails for about $100. Notebooks come in about 15 different sizes and cost about the same as a regular quality notebook.

Just some thoughts.

I have found that my Galaxy Note 2014 is pretty much hands-down the best note taking tablet in my opinion. It's better than the crap that Microsoft and Apple are trying to hawk off. It doesn't have as many fancy apps but for _strictly_ note taking, sharing notes via email, and book reading, it's pretty awesome.

I just wish its price would come down. It's still full price from four years ago :| and even getting more expensive because it's so old

You are referring to the 10.1 tablet? I know its not the same but I had a Note4 and was pleased with its note taking ability. I imagine that with the extra screen real-estate the tablet was even better. It looks like they are $453 on Newegg! I agree that does seem crazy for an old device. If you can put up with a used device there are two on swappa for around $200 https://swappa.com/buy/samsung-galaxy-note-101-2014-wifi If it truly is the hands-down best note taking tablet you might as well buy them both and standardize to the platform.

Yes, this is exactly!

However, I prefer the Wifi-only model. I neither need nor want cell service on my tablet. Not only that, but cellular providers end up installing complete garbage for their software.

I recently broke my first device. I bought a new one that was advertised Wifi-only. The received device was branded for Verizon and was literally and completely unusable without a Verizon SIM card. I ended up getting it replaced with a T-Mobile one which... while isn't not completely unusable, there's still tons of crapware from T-Mobile that I cannot uninstall.

It makes me really sad.

I'll have to check out your link at home. Thanks!

I have to say that I love the galaxy note 8.0 for note taking. So much so that I'm trying to buy up as many as I can find so that I will always have one. There are 2 main reasons for this:

1st: I love the size. The 10s are a little too big to cart around in a jacket pocket, and the note phones are too small to be a decent notebook. (I must admit though, I'm rethinking this point, as I used to be a huge moleskin fan)

2nd: The Samsung s-note app from that time is awesome! Honestly, every time samsung comes out with a new pen based tablet or phone, I always check it out. And I'm looking for 2 features: a: the ability to import a pdf file & b: goddamned smart shapes! I suck at drawing, and so I love that on my 8.0 I can draw a shitty square or circle and the software will make it look all ship shape. But they seemed to have removed that feature from newer versions of s-note, and I haven't found a 3rd party app that can do just those 2 things.

Any pointers to an app that will allow me to import pdfs and draw on them using smart shapes would be awesome!

Why is the Galaxy Note 2014 good for note taking? What app(s) do you use for note taking? What's bad about Microsoft's and Apple's offerings?

As far as Apple's offerings: no idea. The last Apple device I used was an 4th-generation iPod back when that was the new thing. Coming from a desktop-power-user background, iOS 4's UX was extremely offensive. So is Android's, by the way, but at least Android's horrid UX cheaper.

Technicalities: 1) the S Pen's sensitivity is much closer to the image. Microsoft Surface, in contrast, feels and appears about 1/4" above the image: that causes me to mis-write a lot since I feel the stylus at a different location from where the drawing is being applied. I don't know about you, but I don't write with my pen/pencil/stylus straight up 90 degrees from the surface being written upon which is the only way I could reliably use a Surface for writing.

2) the Galaxy Note 2014's resolution is higher than the Surfaces I've used even though it's on a smaller form factor. That means that the image simply looks sharper.

3) the S Pen uses a different tech so that the Galaxy Note is able to recognize the difference between my fingers (or my palm) resting on the tablet's surface vs the stylus writing whatever's being written.

4) Both of the Galaxy Note 2014s I've used have had a battery life of about a full week from a full charge. It does depend on how much you use it, of course, but typically it's about 4.5 days, dying right at the tail end of Friday. That's with writing notes for about 2 or 3 hours during every workday. Contrast with Surface 3 and Surface Pro ... one had lasted a day and a half on average while the other lasted about 10 hours.

5) S Note translates my handwriting with about a 90% accuracy. But OneNote translates my handwriting to literal garbage text for about 50% of the time. When it doesn't translate to garbage text, there's of course the common typos from thinking I->l->1 B->0->O->o->D->etc (which S Note also has trouble with, but not as much).

Opinions: I absolutely despite Microsoft's and Apple's OS offerings these days. I don't like Android, either, but it's the least hated of the three. I use Windows 7 at home for games but otherwise Linux the whole way. Linux allows me to actually customize the UI the way I want; when it doesn't, there's the good ole' command line.

I've found the S Note app to be fairly intuitive and straightforward. It's really easy to export notes to PDF or email, eg to share with coworkers. If you can get around OneNote's shitfest version of OCR, I'll admit that the rest of it pretty intuitive. I really didn't care to learn (yet another) different way of doing things though.

If I had _one_ complaint about the Galaxy Note, it's that I have yet to find a way to have it require Active Directory to unlock; so if that's a requirement then ~~your company sucks~~ you're SOL for using it at work. If someone knows of a viable way to login to an Active Directory account on Android, that'd be awesome.

I also have to say that I actively use Google Play Books. Good luck getting that to work on Surface or iOS outside of ~~Chrome~~ a browser; I never could anyway. If I have to log in to a site using a browser and use the browser to use the site instead of using native apps, then why not just use a Chromebook?

Which smart pen do you have?

I used to use a free software "ComicEnhancerPro" (The author is Chinese, there is English version but may not easy to find reliable download site) specially designed to enhance scanned comics.

You can remove the background very effectively by dragging a curve with preview.

You almost always need to preview and adjust some parameters, unless you have a template for similar cases.

In terms of compression for scanned notes, I haven't found anything that comes close to what even an older version of Adobe Acrobat yields, due to the use of the JBIG2 codec. Has anybody found any way to compress PDF files with JBIG2 on Linux/Mac? It's pretty much the only reason I have to find a Windows machine with Acrobat installed a couple of times a year, to postprocess a batch of scanned PDFs.

`German and Swiss regulators have subsequently (in 2015) disallowed the JBIG2 encoding in archival documents.[19]`

Yeah, but I'm not archiving for the German or Swiss government. For scanned handwritten notes, JBIG2 still beats out anything else by an order of magnitude at least

JBIG2 has been known to swap 6s and 8s. Use it at your own risk. There's a reason nothing uses it anymore.

Is there an alternative that provides comparable compression?

Looks interesting. Normally when I'm "cleaning" up scans I use unpaper, but although there is some overlap in functionality it doesn't do the same.

Anyway very nice writeup and I will add it to my arsenal and give it a closer look later. Could be useful for my document archive+ocr solution.

Edit: too bad seems like it didn't see any activity in the last year

> Edit: too bad seems like it didn't see any activity in the last year

That's not necessarily bad: sometimes a piece of software can be done, or nearly so.

Yup but a project like this would have an empty issue tracker this one not so much ;)

(which doesn't mean that it's a bad project or that I wont use it. it means that I will probably start to work on it)

There actually was some activity in three different branches in january: https://github.com/kskyten/noteshrink/network

Yeah on a branch of a fork.

I have made some progress on this as my home project using same compression and scan. I call it DFA - digital file analytics where data/images/scanned documents are sent remotely using Kafka to Hadoop and then run OCR to extract text and compression. If the document is more then 10MB go to HBase otherwise HDFS. Near real-time streaming using Spark and Flink is done too. Visualization using Banana dashboard is not so cool as it shows word counts, storage location, images and tags. Analytics on top of extracted data using ML would like to do next.

More you can find at https://medium.com/@mukeshkumar_46704/digital-files-ingestio...

Could you expand on your archive+ocr? I long wanted to start doing something like this, but never got to. I guess reading others' experience can be useful.

How does this method compare to adaptive thresholding or Otsu's binarization method?


A lot of apps out there which can do this for you: http://uk.pcmag.com/cloud-services/86200/guide/the-best-mobi...

The Image Capture app on macOS is surprisingly good at this, and one of the things I really miss on Linux and Windows, so this is neat to have.

It’s also interesting for me to think about how this is a generalization of converting a scan to black and white for clarity :)

I wonder if it might improve by using a better color space than HSV. Maybe CieLAB?

Is there much room for improvement? Looks pretty good to me.

It seems to me that the inaccuracies/inefficiencies/errors/whatever you like in using RGB are basically truncated out of existence by the very, very harsh binning that is occurring. I wouldn't expect any visible differences to emerge from any alternate color space.

The link says that generated pdf is a container for the png or jpg image. Is it possible to get a true pdf from the scan? Specifically so that i can search inside the pdf.

Also, see ScanTailor by Joseph Artsimovich. Excellent tool.

It would be interesting to see this paired with potrace.

interesting, i was working on something similar, (get color palette from an image).

This is GEM!

Can be incrementally improved by using a more human-focused color model than HSV, like CIECAM02 or CIELAB.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact