Here is my very quick and dirty manual job of the same example page:
Literally less than five minutes.
First I cropped the image. Then duplicated the layer. Blurred the top layer (Gaussian, 50 radius). Then flipped to Divide mode and merged the visible layers. This leveled the lightness quite well, almost completely eliminating the shadow over he right side of the page and all other lighting differences. There is a hint of the edge of the shadow still present because it is such a sharp contrast; but that can be eliminated in an adjustment of the intensity curves. In such cases it may be helpful to experiment with smaller blur radii, too.
I then did a perspective transform in the lateral direction, squeezing the left side top-bottom and expanding the right, resulting in the warp now being approximately horizontal. (The perspective transform is not just for adding a perspective effect; it is also useful reversing perspective!)
Finally, I used the Curve Bend (with its horrible interactive interface and awful preview) to warp in a compensating way. Basically, the idea is to draw an upper and lower curve which is the opposite of the curve on the page. I made two attempts, keeping the results of the second.
If the preview of this tool wasn't a ridiculous, inscrutable thumbnail, it would be possible to do an excellent job in one attempt, probably close to perfect.
Because the page is evenly light thanks to the divide-by-blurred layer trick, it will nicely threshold to black and white, or a narrow grayscale range.
THIS is exactly the sort of insight I wish I would get when I click one of those "one weird trick" links!
They're a bit obscure if you don't know how they work, basically Grain Extract subtracts the layers and adds 128, so you get sort of a fake 8-bit signed integer. Grain Merge does the opposite (add two layers, subtract 128).
I haven't tried to divide-by-blur (but I'm going to :) ). Grain Extract on the other hand, allows to subtract-with-blur, which is more like what I'd do if I were to code such an algorithm myself (the operation is roughly what the Unsharp Mask filter does, but you get a bit more control). Still I'm curious to see how divide-by-blur differs in its results.
Check this Flickr-post which has a nice explanation of using Grain Extract (used for a different purpose here, it's really a very interesting and versatile trick, only discovered it last week myself):
Subtraction would be fine if the data were logarithmic: that is to say, if the intensity information we are operating on represents decibels. We figure out the low frequency signal's "floor" over each area of the image, and simply subtract those decibels.
Division is better on the assumption that the intensity values of the pixels are linear. Division of linear samples is subtraction, in the decibel scale.
The right way is somewhere in between, because in RGB displays, pixels are put through some gamma curve. They are neither linear nor logarithmic.
By the way, here is a useful trick: blurred layer in Divide mode, plus vary the opacity. You can reduce the amount of contrast between light and dark areas in an image without completely leveling it. Works well with low opacity levels (just a slight blend of the Divide mode layer). If you have a picture with areas that are too dark and too bright, a touch of this can help. High radii tend to work well, with blends of up to 20% or so.
For me, this does what I want Retinex to do! Only better, with intuitive control parameters and predictable results.
Look at the picture of the little girl peering out of the back window of a truck on NASA's Retinex page, as enhanced by Retinex:
Now my version:
The reflection in the glass is brought out in a less cheesy way that you might not guess is the result of processing, if you don't already know.
I did a decompose of the image to the LAB color space (Colors/Components/Decompose...). Then
I used a blur radius of 200 pixels on a copy of the L layer. Put into Divide mode and blended at a bit over 30%, and recomposed from LAB back to the original RGB image.
(That's a simplification; of course I had to struggle with Gimp to preserve the ID of the L layer, which is destroyed by a straight "merge layer down" after which the recompose operation fails with an error. I in fact made two copies of the L layer, and did the processing and merge operation between those two copies. Then I did a select all to copy the resulting layer to the clipboard and pasted that into the original L layer to replace it.)
I wish he went into more details on the steps taken after dewarping. You can tweak the image levels to get good contrast, but surprisingly there aren't any shadows from underleveling or loss of detail from overleveling. I wonder if the author ran OCR on the scans after, and speaking of OCR, IIRC Leptonica is one of the dependencies of Tesseract so it must do some similar pre-processing.
Edit: reading more carefully, he mentions that he used adaptive thresholding from OpenCV.
This code had the same idea, and is open-source!
(this isn't to slight the article - it's a great, well written presentation on how to implement it)
The reason is CamScanner tried to upsell me to some monthly plan.
For all good advice on HN that you should build recurring revenue: It seriously annoys me when people tries to do that by demanding monthly payments for static features.
(Totally OK with selling license keys for new features etc.)
Gotta admire this guy's resourcefulness—and patience. If I were a professor, I'd probably just reject the assignment outright if a student sent me a bunch of photos from their smartphone in lieu of a PDF or a "proper" scan. :)
If you decide to try to make this faster, check out ceres a non-linear least squares optimisation framework that does automatic differentiation using a clever C++ template hack.
I've used it a few times to solve these kind of problems and found it to be very good!
By the way, you mentioned using word boundaries in regexes to replace variable names. GNU Emacs regexes can actually include "symbol boundaries" (which are a little better for variable names than word boundaries), represented as "\_<symbol\_>". Personally, I like using the "highlight-symbol" package, which provides the "highlight-symbol-query-replace" command to basically execute M-% for the symbol at point.
Most phone cameras these days have good resolutions, and you could technically take a 6x4 photo, divvy it to 3x3 grid and take close up photos, and have smart algorithms interpolate the pixels to form a single image with high res. I'd even bet you'd results equal to or better than a flat bed scanner.
For better us, just open the camera preview and slowly pan over the image.
Has someone tried something like this? With FOSS apps like mosaic, hdr tools and imagemagick, it should be possible. I'm guessing opencv would be needed for interpolation and noise removal..
The next hard problem would be help with DIY book scanning. Like the camera could sit over my shoulder and detect when I've turned a page, then automatically take a picture of the new page. Then OCR kicks in, and conversion to EPUB, preserving graphics when pages have non-text elements. Mostly just feel like we can probably do better than the massive contraptions over at www.diybookscanner.org
It likely didn't help that many of the pages were typed carbon copies or hand-written. For the former I had to put another piece of paper behind the page to prevent the next page from bleeding through.
That said, I managed to get the job done, though it took a couple of weeks. Next is to capture the metadata (author, date written, ...). There were a lot of one page letters in the documents I copied.
Would love to try out another tool that could read from the ScanTailor project file to get the page segmentation or even the warping I did manually to improve on the result.
If you have the option go for scanning.
Scans still benefit from cleanup, such as rotation and adjusting the levels (making everything near white perfectly white) and possibly manual cleanup (removal of specks of dust or the odd hair that was on the scanner).
They still warp, but less than laying flat on the table and it's quicker if you do all the even pages, then all the odd pages. Actually the hardest part is making the camera stable so it doesn't move when you press the button.
Lots of info on different hardware setups and scanning software here that can batch translate into pdf and different formats.
There are many ways of doing this, and you can achieve some results even without knowing if your image is text, but just has lots of self-similarity by virtually sliding a "grid" over the image, slicing it up into n-by-n squares, running any of a number of nearest-neighbour variants over it, and then for each cluster replace all instances of the squares in the cluster by the one which minimise the overall error rate vs the others.
This will work reasonably well for very structured images such as text, as long as enough characters are near correct, and will retain custom fonts etc. but clean them up quite a bit as long as they either are different enough, or occur often enough on a page to not get "corrected".
I'm sure there are better ways of doing this too - it's been a decade since I kept up with OCR research.
Those are small optmization problems. These types of problems are solved in computer vision for hundreds of thousands of variables. His problem can be solved in real-time, not tens of seconds.
It still doesn't seem fully accurate as I can imagine a non-rotated cubic curve with endpoints at an offset, but I assume your simplification works well enough.