
Xerox scanners and photocopiers randomly alter numbers in scanned documents - sxp
http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning?
======
agl
This class of error is called (by me, at least) a "contoot" because, long ago,
when I was writing the JBIG2 compressor for Google Books PDFs, the first
example was on the contents page of book. The title, "Contents", was set in
very heavy type which happened to be an unexpected edge case in the classifier
and it matched the "o" with the "e" and "n" and output "Contoots".

The classifier was adjusted and these errors mostly went away. It certainly
seems that Xerox have configured things incorrectly here.

Also, with Google Books, we held the hi-res original images. It's not like the
PDF downloads were copies of record. We could also tweak the classification
and regenerate all the PDFs from the originals.

For a scanner, I don't think that symbol compression should be used at all for
this reason. For a single page, JBIG2 generic region encoding is generally
just as good as symbol compression.

More than you want to know about this topic can be found here:
[https://www.imperialviolet.org/binary/google-books-
pdf.pdf](https://www.imperialviolet.org/binary/google-books-pdf.pdf)

~~~
gngeal
It just occurred to me...

 _The title, "Contents", was set in very heavy type which happened to be an
unexpected edge case in the classifier and it matched the "o" with the "e" and
"n" and output "Contoots"._

Wouldn't it be a good idea to perform OCR - using a language model, the works
- before you start classifying the JBIG2 symbols? That way, you'd have
additional contextual information to say "Aha, 'contoots' is probably not what
it reads here" at least in some of the cases.

Although, I realize that on "Google scale", such a complex solution could be a
problem.

~~~
cmarschner
Language model would give you the opposite problem - eg you scan a print of
_this_ page containing the word "contoots" which your language model corrects
to "contents"...

------
linohh
This was predictable. JBIG2 is in no way secure for document processing,
archiving or whatsoever. The image is sliced into small areas and a
probabilistic matcher finds other areas that are similar. This way similar
areas only have to be stored once.

Yeah right, you get it, don't you? They are similar, not equal. Whenever
there's a probability less than 1, there's a complementary event with a
probability larger than 0.

I wonder which prize idiot had the idea of using this algorithm in a copier.
JBIG2 can only be used where mistakes won't mean the world is going to end. A
photocopier is expected to copy. If the machines were used for digital
document archiving, some companies will face a lot of trouble when the next
tax audit is due.

Digital archives using this kind of lossy compression are not only worthless,
they are dangerous. As the paper trail is usually shredded after successful
redundant storage of the images, there will be no way of determining
correctness of archived data.

This will make lawsuits a lot of fun in the future.

~~~
ams6110
_This will make lawsuits a lot of fun in the future._

Given the way the algorithm works, it would seem to me that "fine print" would
be the most vulnerable to the bug (well not really a bug, it's the behavior of
JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than
10pt type is subject to reasonable doubt if a Xerox copier was used"

~~~
linohh
The trouble is, there is no reasonable in doubt anymore. Copying and digital
archiving both rely on the premise that there is no manipulation. Lossy
compression always seemed to be OK because the image quality was reduced
without changing the integrity and structure of the image. This will
essentially destroy credibility of digital records. Every shyster and hack
lawyer will pull this as defense in court.

Also it's not like there is a reference implementation for encoding JBIG2
everyone uses. We're talking about proprietary libraries which do the
compression. These libraries are compared using performance indicators like
speed, memory usage, etc. This gives sloppy crap implementations an advantage,
because (and I'd bet on that) when the implementation was chosen, the deciders
didn't even have the idea that a compression could actually manipulate the
document content. Automated testing of compression algorithms is hard, because
by design there can never be 100% proof, as the output image is different from
the input image. If the comparison is broken, the test will fail to identify
errors.

The critical failure in the design was thinking that some sort of algorithm
performs equally or better than the human brain at recognizing text in low
quality. This is - up to now - not the case.

Text won't be a big issue as mistakes are kinda easy to spot. Also it's less
probable to have image fragments that seem similar but really aren't. The
Algorithm isn't really smart, it's mostly just pattern matching due to
performance constraints. Thanks to kerning (variation in distance between
individual Characters), I doubt that swapping of Words or sentences will occur
a lot, unless the threshold for reaching significance in the comparing
algorithm is higher than the guy was while designing it.

The real trouble starts when looking at numbers. Numbers are usually typeset
monospaced, right aligned and/or in tables. The possible variations are pretty
low, each digit represents 10 different possible meanings. Text documents are
usually scanned at a pretty low resolution, because for a human it's still
possible to distinguish between characters and numbers, even when a lot of
information is lost. As already mentioned, algorithms cannot do this.

The next problem is: We can spot mistakes in text because there are syntactic
and semantic rules which we more or less understand. While reading, our
subconscience validates the input, obvious errors will pop out. When it comes
to numbers, there is no such thing. A number cannot be validated without
additional knowledge. And as document processing is one of the labour
intensive tasks, mostly executed by minimum wage clerks, there is no way in
hell a mistake would be spotted before the documents are archived for all
eternity.

Let's put on the tinfoil hat for a moment: If someone wanted to really fuck up
a company, they could just flash the printer/scanner/copier firmware, changing
parameters of the compression.

~~~
zvrba
> Every shyster and hack lawyer will pull this as defense in court.

I don't think it will be so trivial to use this defense. As somebody claimed,
JBIG2 _reuses_ sufficiently similar blocks, so I guess it can be relatively
easily determined whether the document has been messed up by lossy
compression.

~~~
vidarh
Assuming the document has not since been converted by anything that might have
re-compressed the images.

------
nsxwolf
Truly surprising. I would never have imagined this to be in the domain of
possible problems one would expect to encounter scanning or photocopying a
document.

It is like taking a picture of my wife with a digital camera and her face
being replaced with that of some other person.

~~~
gcr
That's quite possible.
[http://www.cs.columbia.edu/CAVE/projects/face_replace/](http://www.cs.columbia.edu/CAVE/projects/face_replace/)

I can imagine someone turning the technique into a novel form of image
compression, maybe for surveillance databases or something.

~~~
Gravityloss
Think how someone could falsify your entire life...

~~~
thwest
Then realize you shouldn't define your life based on some digital records.

~~~
delinka
_I_ don't define my life based on some digital records. But law enforcement
(or the executive branch of the US Federal Government, including the NSA)
does. And therein lies the problem.

Someone will be convicted (perhaps even without the intervention of a court)
based on unimpeachable but falsified digital records.

~~~
gridspy
That sounds like the fly in the printer at the start of (the movie) Brazil.

Which of course leads to the conviction and torture of an innocent....

------
ElliotH
I can't quite see the reason why you would lossily compress something when
your machine's purpose is to duplicate things.

Anyone got a reasonable reason for doing this?

~~~
rly_ItsMe
In the good old days of analog copiers this would be impossible - the scanner
send the light through a system of mirrors to the drum, the drum gets static
charged, the toner is pulled on the charged parts and gets transferred to the
transfer belt, here the paper has the opposite charge and pulls the toner off
of the transfer belt, goes through the fusing unit and here is the toner
'burned' to the paper. End of Story

On a modern copier the scanner transfers the data first to RAM and than
usually to a hard disk (the most of the people do not even know that the "copy
machine" has one and saves the scanned stuff to it). From that hard disk the
data where transmitted via laser to the drum

Tadaaa - you have the reason for having data be compressed on a modern copier.

~~~
blt
Yup, and those old analog copiers - good ones at least - had beautiful crisp
output. The resolution was good enough to reproduce printing dots so they
could even duplicate photos from books. Continuous tone of an analog
photograph didn't work as well. They sure were expensive though.

------
harrytuttle
This should be on the computer risks digest.

There is virtually no reason whatsoever for this problem to exist. This is the
domain of "making a problem more risky and complicated than it needs to be"
and royally screwing people in the process.

Might as well throw the paperwork in a bin and set fire to it.

~~~
candeira
Sufficiently advanced bugs are indistinguishable from sabotage.

~~~
candeira
And the converse: sufficiently clever sabotage is indistinguishable from a
bug, as evidenced by the "Xerox copier randomly prints penises" prank:
[https://news.ycombinator.com/item?id=6157422](https://news.ycombinator.com/item?id=6157422)

------
lifeformed
Geeze. This could result in some catastrophic errors. An order for 900 servers
instead of 200. $7M loss instead of $1M in your quarterly earnings. Pricing
your product at $3 instead of $8. Makes you realize you need some redundancy
and double-checks for important communications.

~~~
ams6110
Especially considering that faxes, copies, and scans of documents are legally
the same as the originals, at least for ordinary business purposes.

------
scrumper
I don't think it's necessarily an issue of inexcusable incompetence: it seems
like one of those faults which is obvious in retrospect but very difficult to
predict. Why shouldn't Xerox use a standard compression algorithm in their
scanner/copiers? That would seem to be a safer choice than writing a lossy
compression algorithm from scratch. QA testing probably was on the order of
'picture looks right'; after all, why bother testing that the semantics of the
copied content match the original when what you're building is a bitmap
duplicator? (Of course, the OCR stuff would be tested more rigorously, but
this explicitly bypasses that piece). It's not hard to see the chain of
individually reasonable decisions that could lead to something like this.

The real failure is probably something more cultural: there was nobody with
the discipline, experience, and power to write an engineering policy
prohibiting the use of lossy compression in duplication equipment. I have no
idea about Xerox's corporate history, but the evisceration of engineering
departments in US giants and the concomitant decline in what one might call
'standards' or 'rigor' is an established concept.

~~~
rdtsc
> Why shouldn't Xerox use a standard compression algorithm in their
> scanner/copiers?

I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch,
arithmetic coding compression and I have never heard of JBIG2. And here the
are using and it others claiming it is just a standard run of the mill thing.

> That would seem to be a safer choice than writing a lossy compression
> algorithm from scratch.

Going out on a limb here, wouldn't the safest be to just not use a lossy codec
at all or use something like JPEG?

> QA testing probably was on the order of 'picture looks right';

Sorry. This is the company whose name is the equivalent to the verb "to copy".
If plugging in an obscure codec from some place and checking if one picture
looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits
stemming from this.

~~~
jws
JBIG2 is hardly obscure. It is billed just as prominently on the official JPEG
site as JPEG and JPEG2000.

It is useless to someone that wants to compress arbitrary images, since it is
bi-level only, I'd ignore it too if I wanted to compress a photograph. Not
having an open specification hurts. The "last draft" is available, but the
final was sacrificed to someone's business model.

~~~
rdtsc
You are right, I was just saying I was playing with image compression and just
hadn't found JBIG2. Also probably because it has a patent associated with it
and it is mainly for bi-level images.

------
micheljansen
Ouch, imagine this happens in a hospital with a prescription or something. It
could really have some serious implications.

~~~
hga
Indeed, I keep a copy of my lab results for the last N years because they
sometimes get lost, once through no real fault of the doctor
([http://en.wikipedia.org/wiki/2011_Joplin_tornado](http://en.wikipedia.org/wiki/2011_Joplin_tornado)).

Grrr, I'm now going to have to view every lab report that's not an original
with suspicion, and make sure my doctors aren't making recommendations due to
screwed up copies.

Lossy compression is _not_ an acceptable default for a general purpose device.

~~~
VMG
This isn't even lossy compression - it's misleading compression

~~~
micheljansen
It's one of the worst examples of "seamless design"[1] I have ever seen.

[1] [http://jim-mcbeath.blogspot.co.uk/2008/11/seamful-
design.htm...](http://jim-mcbeath.blogspot.co.uk/2008/11/seamful-design.html)

------
wahnfrieden
Cached copy, which is missing the updated content:
[http://webcache.googleusercontent.com/search?q=cache%3Awww.d...](http://webcache.googleusercontent.com/search?q=cache%3Awww.dkriesel.com%2Fen%2Fblog%2F2013%2F0802_xerox-
workcentres_are_switching_written_numbers_when_scanning%3F&oq=cache%3Awww.dkriesel.com%2Fen%2Fblog%2F2013%2F0802_xerox-
workcentres_are_switching_written_numbers_when_scanning%3F&aqs=chrome.0.69i57j69i58.1703j0&sourceid=chrome&ie=UTF-8)

~~~
greenyoda
Thanks. As of now, the cache seems to contain the update. It begins with:

 _Edit: In the last section, it is now sketched what the reasons for the issue
may be, on the basis of several emails I got._

------
model-m
If I were a sentient network and wanted to cause panic among the humans, as a
prelude to full-blown warfare, this is how I'd start. Let's send all those
Xerox copiers to Guantanamo, they are obviously terrorists.

------
D9u
My first thought was, "I wonder if this has anything to do with copy
protections related to anti counterfeiting?"

Not that I have any valid reasons to consider this.

~~~
a3_nm
Agreed. I worried that this might be yet another example of printers and
scanners doing strange things, like
<[https://en.wikipedia.org/wiki/Printer_steganography>](https://en.wikipedia.org/wiki/Printer_steganography>)
or
<[https://en.wikipedia.org/wiki/EURion_constellation>](https://en.wikipedia.org/wiki/EURion_constellation>).
Glad to see that this can be ascribed to incompetence rather than malice.

------
ChuckMcM
Given the challenges of JBIG2 it seems one should be able to construct a
'test' page which, when scanned, will test the algorithm's accuracy.

Once you have that, you can turn it into a sales too for folks selling Multi-
function Printers such that there are "good" printers and "bad" printers, and
then everyone will be forced to pass the test or be labeled a 'bad' printer.

------
gmac
Wow, how terrifically and fundamentally negligent. Let's hope nobody dies —
the potential hazards seem almost endless.

------
tingletech
Humm, I use one of these to create PDFs of reciepts to attach to my exense
reports.

------
noonespecial
That's one hell of an error. It is literally better for these machines never
to have existed at all.

------
raphman
Just an update: the author states on Twitter that he already had notified
Xerox a week ago [1]. Apparently, Xerox has only now contacted him because
they thought it was a joke [2] ...

[1]
[https://twitter.com/davidkriesel/status/364345036407709697](https://twitter.com/davidkriesel/status/364345036407709697)

[2]
[https://twitter.com/davidkriesel/status/364329334300880896](https://twitter.com/davidkriesel/status/364329334300880896)

------
uptown
Reminded me of this:

"Digital Photocopiers Loaded With Secrets"
[http://www.youtube.com/watch?v=Wa0akU8bsOQ](http://www.youtube.com/watch?v=Wa0akU8bsOQ)

~~~
deletes
Don't all these kinds of machines have a scrub disk option? Or just take the
disk out and scrub it.

~~~
uptown
Possibly. The focus of the story was that (at least at the time) many of the
owners/leasers of these machines had no idea they contained drives that
retained the scanned documents.

------
tudorconstantin
Now that's a bug I wouldn't like being responsible for

~~~
akleen
I don't think the programmer who coded it is to blame. The manager who (very
likely) cut the QA needed to save a few bugs to find it is.

------
randomfool
This is a massive error- on the order of Intel's FDIV bug.

------
w_t_payne
Wow. I cannot imagine how much chaos this could cause.

------
NamTaf
Do we know the scope of likely affected printers? The company I work at runs a
whole heap (~80) of WorkCentre 3220, 4150 and 4250s, as well as ApeosPorts,
etc.

I shudder to think how much we've scanned that could be affected by this.
Thankfully, I think all of our engineering drawings (which for a decade+ were
printed, signed, then scanned when needed for digital issue) were done on a
non-xerox device, but all of our standard A3/A4 business stuff is done on
Xerox devices.

------
yew
Minor correction: The article says that the JBIG2 patch size might be the size
of the scanned text. JBIG2 actually has the capability to detect regions of
text and compress them using a specialized technique that operates on
individual symbols.

I suspect Xerox is using this option and their implementation is getting
confused (perhaps by the low resolution). Unless I'm greatly mistaken, the
patch size for normal compression shouldn't figure here.

~~~
ygra
I was confused by that as well. From what I understood how JBIG2 worked, those
symbols don't even have to have the same size everywhere (as would be quite
common with proportional fonts anyway). So there is no "patch size" per se;
just the low resolution confusing the classifier.

~~~
linohh
I doubt the patch size is even configurable, as identified patterns can be
scaled accordingly. However the author is not to blame, because JBIG2 is
poorly documented and the implementation of the compressor is not specified in
the standard.

------
Too
Now to the important question: How can i easily assert that my scanner, or the
next scanner i buy, does not have the same issue?

~~~
ygra
Scan in black & white, scan in grayscale, reduce grayscale to bi-level and
compare.

------
praptak
Your honour, my computer was hacked. Oh, you don't believe that? Well then, I
used a Xerox copier!

------
Canada
That's what you get when you use lossy compression, and it's hardly a problem
unique to Xerox scanners. Maybe important documents should be scanned to a
higher resolution so you don't have problems like this.

~~~
mikeash
Could you share with us a list of other scanners that have this problem, so we
can avoid them?

~~~
WalterBright
I tried various compression and density settings on my Fujitsu scanner, and
didn't see any problems like those mentioned in the article.

~~~
mikeash
Thanks. I scan a ton of documents with my Fujitsu scanner, so that's
particularly relevant to me.

------
tehwalrus
TIFF with lossless compression all the way. followed up with OCR if necessary.

------
mjlangiii
Has anyone recreated this issue? I haven't been able to.

~~~
mjlangiii
On a xerox 7535 I was able to recreate the problem when using the example
sheet of numbers provided in the tiff image.

------
hga
Ack! Looks like overclever compression in a domain where it's not always
desired, let alone required.

I spent half a decade on document imaging in the early to mid '90s, a fair
amount close to this level (had a coworker who loved bit level stuff for the
truly evil problems like this), and I can see how it happened ... given
sufficiently careless developers.

