
Xerox scanners/photocopiers randomly alter numbers in scanned documents (2013) - jf
http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
======
walterbell
This is relatively new:

 _> March 2015: The German Federal Office for Safety in Information Technology
bans JBIG2 from being used for archival purposes. _

------
astrange
The similar blocks technique used in JBIG2 is what's called the prediction
step in many image codecs, or intra prediction in video. That specific
technique is a perfectly fine idea so I've always wondered why more codecs
don't use it - usually they don't let motion vectors point into the current
frame.

But usually there's a step after it called residual coding, where you subtract
the predicted image from the original and send the difference to make up for
errors. Just leaving that out is, um, interesting.

~~~
thaumasiotes
> JBIG2, the image format used in the affected PDFs, usually has lossless and
> lossy operation modes. Pattern Matching & Substitution“ (PM&S) is usually
> the standard operation mode for lossy JBIG2, and “Soft Pattern Matching”
> (SPM) for lossless JBIG2

> Both operation modes have the basics in common: Images are cut into small
> segments, which are grouped by similarity. For every group only a
> representative segment is is saved that gets reused instead of other group
> members, which may cause character substitution. Different to PM&S, SPM
> corrects such errors by additionally saving difference images containing the
> differences of the reused symbols in comparison to the original image. At
> Xerox, by error the PM&S mode seems to have been used not only in the
> “normal” compression mode but also in the “higher” and “high” modes.

~~~
astrange
In sensible codecs, lossy and lossless modes use all the same steps. You then
make it lossy by rounding the residual values, so they fit in less bits. Here
they have nothing in between 100% and 0% quality.

------
gus_massa
This is very interesting and has been resubmited a few times. Most popular:
[https://news.ycombinator.com/item?id=6156238](https://news.ycombinator.com/item?id=6156238)
(570 points, 655 days ago, 112 comments )

~~~
digital-rubber
Thought i got a deja-vu reading the title. Thanks for confirming my memory
still sort of works :)

------
sydney6
Talk from 31nd Chaos Communication Congress:
[https://media.ccc.de/browse/congress/2014/31c3_-_6558_-_de_-...](https://media.ccc.de/browse/congress/2014/31c3_-_6558_-_de_-
_saal_g_-_201412282300_-_traue_keinem_scan_den_du_nicht_selbst_gefalscht_hast_-
_david_kriesel.html)

mp4 file has a additional english audio track.

------
Lorento
I have a pdf scan of a text book that exibits this problem. It's a complex
math book with a lot of superscripts, subscripts and combinations of them.
Some are the wrong symbol - eg i instead of j, etc. It was very difficult to
study from not being able to trust any symbol. I put it down to bad editing of
the printed original until I found out about this story sometime ago.

~~~
userbinator
One of the links in the comments of the original post has a good example of
that:

[http://everist.org/NobLog/20131122_an_actual_knob.htm#jbig2](http://everist.org/NobLog/20131122_an_actual_knob.htm#jbig2)

(JBIG2 discussion near bottom, rest of page is about electronics.)

------
noipv4
It's not OCR, but I would call it Optical Pattern Recognition. I think it's
kind of a compression, similar image segments are stored only once in memory
and repeated occurrences are linked to the one stored copy. Unfortunately
identifying similar image segments is not that accurate in these Xerox scanner
devices.

------
afarrell
> Everyone who gets in touch with me requesting help to evaluate their own
> situation can be certain not to get his identity revealed by me. I acted
> this way across the entire xerox saga and I won't stop this way of acting
> now. My contact data can be found in the imprint.

Can the author actually legally make this guarantee?

~~~
cfield
What is the legal issue?

~~~
stonogo
Presumably a government agency could require him to disclose this information.

~~~
dalke
In that case, no one can make this guarantee. The author's home might be
infiltrated, or the might be threatened with death in order to force the
information.

Suppose this were to go to court. If there are multiple interpretations for a
phrase, and one interpretation is not realizable (due to almost tautological
reasons), then the courts are very unlikely to use that definition. Instead,
they will likely say there were implicit qualifiers like "to within the limits
of what it allowed by the law" and "unless believably threatened with the loss
of life, limb, home, or similar serious physical threat" or "following
information security principles appropriate for the expected threat model of
an civil/economics topic".

------
ris
Yeah, this is two years old.

------
sengork
"OCR rot"

~~~
x3n0ph3n3
If you read the article, you'd know that OCR isn't used here. The issue is way
more interesting than that.

~~~
thaumasiotes
Well... this issue isn't exactly unrelated to OCR. OCR compresses images of
text by representing them as text, which is more abstract and therefore takes
less space to describe. The particular glyphs being recognized are fixed in
advance -- platonic a, b, c, etc.

Here, it would be fair to describe what's going on as OCR with the glyphs not
being fixed in advance, but rather being discovered on the fly by the
algorithm. The entire concept is to identify sections of the image that "show
the same thing", and replace the data in those sections with pointers to a
single representative patch. That's really not so different from compressing
image data that looks suspiciously similar to a capital A down to the one byte
0x41. It's just that different image sections are being Optically Recognized
as "similar to each other" rather than "similar to this hardcoded reference
glyph".

~~~
jahewson
OCR generally does not work as you describe. The common case is for the OCR
system to tag charactes in an image, so that text may be selected. More
advanced systems will generate fonts from the images and replace the text with
those. Either way, the text isn't reduced to a single byte.

~~~
thaumasiotes
I've read plenty of kindle books that were clearly the product of OCR. True,
"cl" hasn't reduced the image of a lowercase d to a single byte, but that was
the intention. Don't confuse OCR, the concept, with OCR-as-implemented-in-a-
particular-way, or with a-process-that-we-called-OCR-because-OCR-is-involved-
at-some-point. OCR is any system that recognizes sections of image data as
matching letter shapes[1].

"Generating a font from the image and replacing the original image data with
that" is a very good description of what's going on here.

[1] Or numbers, or symbols like parentheses. The basic concept is letters.

