The similar blocks technique used in JBIG2 is what's called the prediction step in many image codecs, or intra prediction in video. That specific technique is a perfectly fine idea so I've always wondered why more codecs don't use it - usually they don't let motion vectors point into the current frame.
But usually there's a step after it called residual coding, where you subtract the predicted image from the original and send the difference to make up for errors. Just leaving that out is, um, interesting.
> JBIG2, the image format used in the affected PDFs, usually has lossless and lossy operation modes. Pattern Matching & Substitution“ (PM&S) is usually the standard operation mode for lossy JBIG2, and “Soft Pattern Matching” (SPM) for lossless JBIG2
> Both operation modes have the basics in common: Images are cut into small segments, which are grouped by similarity. For every group only a representative segment is is saved that gets reused instead of other group members, which may cause character substitution. Different to PM&S, SPM corrects such errors by additionally saving difference images containing the differences of the reused symbols in comparison to the original image. At Xerox, by error the PM&S mode seems to have been used not only in the “normal” compression mode but also in the “higher” and “high” modes.
In sensible codecs, lossy and lossless modes use all the same steps. You then make it lossy by rounding the residual values, so they fit in less bits. Here they have nothing in between 100% and 0% quality.
> That specific technique is a perfectly fine idea
I don't know… This whole thing reminds me of "The King's Breakfast" by A. Milne. Is it so hard for scanner to make exact copy of a piece of paper? It doesn't sound like a rocket science, really. I don't want it to try and compress anything, to leave watermarks or whatever. If I feel I need a better compression — I'll use a separate tool (which Xerox can provide if it wants so), I don't want my sanner/photocopier even to try modifying image somehow without my specific request to do so.
Maybe I don't understand real technological reasons of doing so… but then I really don't. I cannot think of any possible reason of such step being necessary. Except if it doesn't have enough RAM to store image as is, but that really sounds unlikely.
But usually there's a step after it called residual coding, where you subtract the predicted image from the original and send the difference to make up for errors. Just leaving that out is, um, interesting.
Obviously because it would make the file bigger if that was included...
The concept behind JBIG2 is good - small variations and random pixels are likely to be scanner noise/dust, so suppressing them can reduce filesize significantly. The problem here is that some JBIG2 implementations can be too lossy, and throw away the few pixels that could make all the difference between e.g. a 6 or an 8.
I have a pdf scan of a text book that exibits this problem. It's a complex math book with a lot of superscripts, subscripts and combinations of them. Some are the wrong symbol - eg i instead of j, etc. It was very difficult to study from not being able to trust any symbol. I put it down to bad editing of the printed original until I found out about this story sometime ago.
It's not OCR, but I would call it Optical Pattern Recognition. I think it's kind of a compression, similar image segments are stored only once in memory and repeated occurrences are linked to the one stored copy. Unfortunately identifying similar image segments is not that accurate in these Xerox scanner devices.
> Everyone who gets in touch with me requesting help to evaluate their own situation can be certain not to get his identity revealed by me. I acted this way across the entire xerox saga and I won't stop this way of acting now. My contact data can be found in the imprint.
Can the author actually legally make this guarantee?
In that case, no one can make this guarantee. The author's home might be infiltrated, or the might be threatened with death in order to force the information.
Suppose this were to go to court. If there are multiple interpretations for a phrase, and one interpretation is not realizable (due to almost tautological reasons), then the courts are very unlikely to use that definition. Instead, they will likely say there were implicit qualifiers like "to within the limits of what it allowed by the law" and "unless believably threatened with the loss of life, limb, home, or similar serious physical threat" or "following information security principles appropriate for the expected threat model of an civil/economics topic".
Well... this issue isn't exactly unrelated to OCR. OCR compresses images of text by representing them as text, which is more abstract and therefore takes less space to describe. The particular glyphs being recognized are fixed in advance -- platonic a, b, c, etc.
Here, it would be fair to describe what's going on as OCR with the glyphs not being fixed in advance, but rather being discovered on the fly by the algorithm. The entire concept is to identify sections of the image that "show the same thing", and replace the data in those sections with pointers to a single representative patch. That's really not so different from compressing image data that looks suspiciously similar to a capital A down to the one byte 0x41. It's just that different image sections are being Optically Recognized as "similar to each other" rather than "similar to this hardcoded reference glyph".
OCR generally does not work as you describe. The common case is for the OCR system to tag charactes in an image, so that text may be selected. More advanced systems will generate fonts from the images and replace the text with those. Either way, the text isn't reduced to a single byte.
I've read plenty of kindle books that were clearly the product of OCR. True, "cl" hasn't reduced the image of a lowercase d to a single byte, but that was the intention. Don't confuse OCR, the concept, with OCR-as-implemented-in-a-particular-way, or with a-process-that-we-called-OCR-because-OCR-is-involved-at-some-point. OCR is any system that recognizes sections of image data as matching letter shapes[1].
"Generating a font from the image and replacing the original image data with that" is a very good description of what's going on here.
[1] Or numbers, or symbols like parentheses. The basic concept is letters.
> March 2015: The German Federal Office for Safety in Information Technology bans JBIG2 from being used for archival purposes.