
Xerox responds to the recent character substitution issue - soulclap
http://realbusinessatxerox.blogs.xerox.com/2013/08/06/always-listening-to-our-customers-clarification-on-scanning-issue/
======
soulclap
Previous discussion:
[https://news.ycombinator.com/item?id=6156238](https://news.ycombinator.com/item?id=6156238)

Follow-up blog post about a conference call with Xerox:
[http://www.dkriesel.com/en/blog/2013/0806_conference_call_wi...](http://www.dkriesel.com/en/blog/2013/0806_conference_call_with_xerox)

~~~
jevinskie
Inadvertent downvote, I'm sorry! =(

~~~
quantumpotato_
(OT: how do you downvote on HN?)

~~~
OrsenPike
You need >500 karma

~~~
quantumpotato_
That's amazing. Thanks.

------
eksith
"We do not normally see a character substitution issue with the factory
default settings..."

It shouldn't be seen with _any_ setting. Nothing you can do to the device
(short of involving a hammer) should change the _content_ in any way.
Compress, resize, zoom, do whatever, but it simply must not change the
_content_ at any time at any resolution/quality.

I'm just flabbergasted that such a compression scheme was ever implemented in
the first place. Surely, there are alternative OCR based methods do
compression that don't introduce these artifacts (that's putting it mildly) at
lower resolutions.

~~~
dman
So you only want non lossy compression as an option?

~~~
eksith
By "lossy" you mean "17" may look like a crappier "17" with reasonable
confidence, but will never, _ever_ , become "21" at any compression setting,
then I don't mind lossy. That's not asking for too much, is it?

~~~
gwright
But the scanner isn't starting off with "17" (as in two ascii characters) it
is starting off with a bit mapped image that your brain happens to interpret
as the number 17. It _is_ too much to ask that a lossy image compression
algorithm _never_ result in a compressed bitmap that your brain interprets
exactly as the original.

Having various compression/quality options allows you to pick the tradeoff
(file size/resulting quality) that is acceptable for your situtation. There is
no perfect setting for all situations. Even the original bitmap is an
imperfect (i.e. lossy) rendering of the original document.

~~~
Gormo
It seems a bit too coincidental that images to which human beings assign
semantic value are being transformed into images to which human beings assign
_different_ semantic value.

I don't expect the scanner to have _any_ semantic awareness of the document
content, so when I hear "lossy compression", my expectation is "image may
become illegible", and not "image may remain legible, but become inaccurate".

~~~
jessedhillon
This is hacker news -- I don't expect everyone to know how jbig2 or other
compression scheme works. But before you insinuate that the scanner has
semantic awareness of the document and is altering that meaning in a less-
than-coincidental way, I would hope that you could have a cursory look at how
such compression works.

The issue only involves small letters, because the compression scheme breaks
up the image into patches and then tries to identify visually similar blocks
and reuse them. Certain settings can allow for small blocks of text to be
deemed identical, within a threshold, and thus replaced. That's all.
Coincidence, not semantic awareness.

Hence the advisory notice to use a higher resolution -- smaller block sizes.

~~~
enraged_camel
>>This is hacker news -- I don't expect everyone to know how jbig2 or other
compression scheme works.

As opposed to what, ImageCompression News where you _can_ expect everyone to
know it?

~~~
tbirdz
Or maybe comp.compression

------
binarymax
This is an absolutely hilarious technical response to a real world customer
issue. The customer does not care one iota that their photocopier uses one
compression algorithm over another. And the fact there is not one mention of
the word 'copy' in that entire post, is very telling of the technical
disconnect exhibited here. The 'Xerox devices' in question are completely
broken from a usability perspective.

~~~
ddunkin
It's no longer a 'copy machine' at this point, it's an 'approximation machine'
and can't even be trusted for legal purposes.

------
wtallis
So they claim that the fine print warns about character substitution. But they
still are willing to label the option with that problem "normal quality" and
suggest using "high quality" to get strictly image compression applied with no
OCR. They don't seem to understand that a photocopier should in its normal
operating mode never do post-processing that creates such surprising and
misleading artifacts - better illegible and obviously so than legible but
incorrect.

Don't get me wrong - using OCR is a great compression technique, but if it
isn't reliable enough, it shouldn't be the default or "normal" setting.

~~~
hyborg787
This has nothing to do with OCR. It's an issue with the JBIG2 compression re-
using similar patches as substitutes for certain areas of the images if
they're "close enough". This issue is exacerbated at lower resolutions.

~~~
ToothlessJake
Thats OCR...

~~~
hga
Well, it doesn't go all the way, at least in this implementation (contrary to
Xerox's statement, we've been told _compression_ is not standardized), to
actually recognize the symbols it finds. If it _did_ , it would presumably
make many fewer of these errors, maybe almost none since when it's uncertain
it could just go with the original.

~~~
ToothlessJake
O-SubC-R then perhaps. Still is recognizing shapes/symbols which is the very
basis of OCR.

This seems a bit hair splitty when the end result is the same as invalid OCR
dictionaries.

~~~
Groxx
Well sure, but then why don't we just call it "lossy GZIP"? OCR is a pretty
specific subset, and produces _characters_ \- this does not produce computer-
readable characters, therefore not OCR.

~~~
ToothlessJake
What are you on about? What does it produce if not computer-readable
characters? Computer illegible characters? Are you saying it cannot read from
the dictionary it creates? Or from the characters it is later optically
recognizing off that dictionary?

Again from the JBIG2 wiki[1]:

"Textual regions are compressed as follows: the foreground pixels in the
regions are grouped into symbols. A dictionary of symbols is then created and
encoded.."

It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason,
its implementation in this case is an absolute failure.

[1] [http://en.wikipedia.org/wiki/JBIG2](http://en.wikipedia.org/wiki/JBIG2)

~~~
Groxx
Does it produce ASCII? UTF? If no, it's not OCR.

edit: by the definition you seem to be going on, any facial recognition is
also OCR, since you could consider a face a 'glyph' (edit: 'symbol'). The only
'text' thing here that I can see is that it is _intended_ to be used on text,
which lends some optimizations, nothing that it's actually text-based in any
way.

~~~
Dylan16807
If you make a font out of faces and use them as repeated glyphs then yes it's
OCR. If you're not using identical symbols over and over than I don't think
you have a sane definition of 'glyph'.

------
ChuckMcM
That is an astonishing response. Reminds me a bit of the first time EMC
pointed out that while it was possible to have your data corrupted in their
hash based storage system, it probably would never happen.

I was expecting "Here is new firmware and we apologize for using JBIG2, won't
happen again."

One wonders if JBIG2 is used in the storing of checks by banks (my bank these
days only sends me images of my checks, never the actual check any more) or
DMV records, or any number of things.

So in the previous thread I suggested a JBIG2 test image, now I want to build
one that if you copy it, it goes from one thing to something else entirely!

------
205guy
This is an interesting story with lots of odd comments.

First of foremost, I agree that Xerox putting their name on a product which
creates an unfaithful copy is corporate suicide. Such an ancient paragon of
computer innovation should be able to come up with a clever algorithm that
compresses but doesn't substitute image bits.

But...

\- The original story[1] didn't mention that the product itself warns against
the very thing they are reporting. Did they ignore that warning, did the
copier not show it, did they use a setting that did not have the warning?
Their further posts cover the issue, so it looks like somebody else set the
resolution and ignored the warning.

\- Calling what the JBIG2 algorithm does "OCR" is misleading. OCR is pretty
much understood to be analog text (image) to digital text (ASCII, UTF-32).
Matching to a real character set and outputting those characters is a defining
part of true OCR. It's also confusing because the copiers have a true OCR
function, and this is not related. What JBIG2 does, I would call it "sub-image
matching and substitution."

\- Calling JBIG2 "lossy" is also misleading. I suppose it is lossy by
definition, but lossy is usually limited to pixel effects as seen in JPG, no
image blocks.

\- JBIG2 seems like an algorithm that shouldn't be used on low-res text
documents. You might say it's just a configuration of the algorithm, but if
engineers can't take it as a tool and use it correctly, you start to wonder if
it's a problem with the tool.

[1] [http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_...](http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_are_switching_written_numbers_when_scanning)?

------
nsxwolf
When you read a scanned or copied document, your confidence in the information
is based on its quality.

There comes a point when the quality is so poor that you no longer trust your
interpretation. Is that a 3? An 8? If you can't tell, you will not act on that
information without further clarification.

This compression algorithm destroys this process.

How can you trust what you are reading anymore? How do we know there isn't a
bug that sometimes causes the content substitution when the source text is
large and perfectly legible?

Disk space is not at enough of a premium to justify this.

------
morsch
I was curious to see how JBIG2 fares compared to JPEG, and found a benchmark
from 2010 [1] comparing the file size of the resulting PDF:

    
    
      convert *.jpg JPEG.pdf -- 43777 kb
      convert *.png PNG.pdf -- 6907 kb
      jbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf -- 947 kb
      jbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf -- 1451 kb
    

Quite a difference. I don't quite understand how JPEG fares so poorly compared
to (lossless) PNG, maybe because it doesn't do monochrome?

[1] [http://ssdigit.nothingisreal.com/2010/03/pdfs-jpeg-vs-png-
vs...](http://ssdigit.nothingisreal.com/2010/03/pdfs-jpeg-vs-png-vs-jbig.html)

~~~
duskwuff
JPEG is optimized for photographic images with lots of smooth gradients. It
does badly with sharp edges, which scanned documents tend to contain a lot of.

------
eyeareque
Sounds like the "recognized industry standard JBIG2 compressor" is just about
useless for copy machines. Why even give a user the ability to do this?

The only acceptable fix for this is to disable the ability to use lower
compression qualities that have could EVER cause this to happen.

~~~
gpvos
JBIG2 is not the problem. It is perfectly possible to do lossless compression
with JBIG2. They just set its options to do some overly aggressive
compression.

~~~
ToothlessJake
I would wager JBIG2 is the problem when Xerox couldn't implement it properly.

"Normal" is an overly aggressive compression setting? Is that an overly
aggressive setting for the end-user or for Xerox to be implementing in their
hardware marketed to law firms?

------
Cyclosa
Xerox is oh-so subtly shifting the blame on to the user. How slimy.

------
speeder
I am quite disappointed by their response.

I expected something better from Xerox, instead it is a sort of: "You are a
stupid costumer, leave it on default and stop bothering me, it is not my fault
you find bugs when not using the default."

------
mikeash
Standard idiot-box weasel-wording. Another case study to put on the enormous
pile of examples of how not to communicate with your customers.

Pretend you care, blame the users, and don't take any action. Hey, what could
be wrong with that?

------
mark-r
These are multi function devices meant to be used by many people. If someone
in your office has need to make occasional scans that need to fit in an email,
isn't it natural to assume they might configure the machine for maximum
compression? Why should that setting affect copies?

------
mathattack
This may be an issue of giving people too much choice. Should users have the
freedom to make terrible mistakes? Maybe in Linux, but not Windows. Similarly,
you don't want an inexperienced secretary to get your company in legal
trouble. Blaming the users can kill Xerox.

------
emmelaich
Off topic but it was awesome to see a link with "perl-bin" in it. I nice
insight into what really does the important work in these big shiny
corporations. :-)

------
ARothfusz
Is this a real response? It is bylined as "Guest Blogger" and is not in an
official-looking blog.

------
preinheimer
Xerox: You had one job.

------
workbench
"the device web user interface"

Why on earth does a scanner have a web interface

