Hacker News new | past | comments | ask | show | jobs | submit login
Xerox scanners and photocopiers randomly alter numbers in scanned documents (dkriesel.com)
570 points by sxp on Aug 4, 2013 | hide | past | web | favorite | 112 comments

This class of error is called (by me, at least) a "contoot" because, long ago, when I was writing the JBIG2 compressor for Google Books PDFs, the first example was on the contents page of book. The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".

The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.

Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.

For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.

More than you want to know about this topic can be found here: https://www.imperialviolet.org/binary/google-books-pdf.pdf

How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...

Yeah, and because the libraries are not open source, we'll never be able to check who failed big time.

Actually, that doesn't matter all that much. You ought to scan it into a TIFF file and then process it the way you want it. If you want a good JBIG2 compressor according to your liking, you have to write it yourself anyway, I don't think that the printer hardware and SW is up to that task.

The idea is actually very smart: given the infinite (and multidimensional) space of encoder solutions, fixing the bit encoding and the decompression process was very smart. It's like with PDF: it's well defined how to draw it into a bitmap but you're not constrained as to how you generate the layout, what line break algorithm you use etc.

It just occurred to me...

The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".

Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.

Although, I realize that on "Google scale", such a complex solution could be a problem.

Language model would give you the opposite problem - eg you scan a print of _this_ page containing the word "contoots" which your language model corrects to "contents"...

JBIG2 [91] suggests using OCR to verify that you didn't mangle anything. If the compressed result has a lower success rate in matching words than the original, then you did something wrong.

[91] http://jbig2.com/

This was predictable. JBIG2 is in no way secure for document processing, archiving or whatsoever. The image is sliced into small areas and a probabilistic matcher finds other areas that are similar. This way similar areas only have to be stored once.

Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.

I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.

Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.

This will make lawsuits a lot of fun in the future.

Thinking about how often I use scan to PDF and e-mail with important documents, this article give me the shivers. This is an epic fuck-up. Nothing less than grossly-negligent.

This will make lawsuits a lot of fun in the future.

Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"

The trouble is, there is no reasonable in doubt anymore. Copying and digital archiving both rely on the premise that there is no manipulation. Lossy compression always seemed to be OK because the image quality was reduced without changing the integrity and structure of the image. This will essentially destroy credibility of digital records. Every shyster and hack lawyer will pull this as defense in court.

Also it's not like there is a reference implementation for encoding JBIG2 everyone uses. We're talking about proprietary libraries which do the compression. These libraries are compared using performance indicators like speed, memory usage, etc. This gives sloppy crap implementations an advantage, because (and I'd bet on that) when the implementation was chosen, the deciders didn't even have the idea that a compression could actually manipulate the document content. Automated testing of compression algorithms is hard, because by design there can never be 100% proof, as the output image is different from the input image. If the comparison is broken, the test will fail to identify errors.

The critical failure in the design was thinking that some sort of algorithm performs equally or better than the human brain at recognizing text in low quality. This is - up to now - not the case.

Text won't be a big issue as mistakes are kinda easy to spot. Also it's less probable to have image fragments that seem similar but really aren't. The Algorithm isn't really smart, it's mostly just pattern matching due to performance constraints. Thanks to kerning (variation in distance between individual Characters), I doubt that swapping of Words or sentences will occur a lot, unless the threshold for reaching significance in the comparing algorithm is higher than the guy was while designing it.

The real trouble starts when looking at numbers. Numbers are usually typeset monospaced, right aligned and/or in tables. The possible variations are pretty low, each digit represents 10 different possible meanings. Text documents are usually scanned at a pretty low resolution, because for a human it's still possible to distinguish between characters and numbers, even when a lot of information is lost. As already mentioned, algorithms cannot do this.

The next problem is: We can spot mistakes in text because there are syntactic and semantic rules which we more or less understand. While reading, our subconscience validates the input, obvious errors will pop out. When it comes to numbers, there is no such thing. A number cannot be validated without additional knowledge. And as document processing is one of the labour intensive tasks, mostly executed by minimum wage clerks, there is no way in hell a mistake would be spotted before the documents are archived for all eternity.

Let's put on the tinfoil hat for a moment: If someone wanted to really fuck up a company, they could just flash the printer/scanner/copier firmware, changing parameters of the compression.

> Every shyster and hack lawyer will pull this as defense in court.

I don't think it will be so trivial to use this defense. As somebody claimed, JBIG2 _reuses_ sufficiently similar blocks, so I guess it can be relatively easily determined whether the document has been messed up by lossy compression.

Assuming the document has not since been converted by anything that might have re-compressed the images.

"The image is sliced into small areas and a probabilistic matcher finds other areas that are similar."

"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."

If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.

Truly surprising. I would never have imagined this to be in the domain of possible problems one would expect to encounter scanning or photocopying a document.

It is like taking a picture of my wife with a digital camera and her face being replaced with that of some other person.

That's quite possible. http://www.cs.columbia.edu/CAVE/projects/face_replace/

I can imagine someone turning the technique into a novel form of image compression, maybe for surveillance databases or something.

A very tech savvy friend bought a camera in Japan and after about a week or started delving into settings. He thought all the faces looked wrong. He found a setting that made the eyes bigger and rounder. It was subtle, but quite funny at the same time.

Most of the consumer compact cameras have a "purikura" setting or a "beauty" setting with special treatment for the skin, whiter eyes and whiter teeths, and eventually bigger eyes and smaller mouth (yes, that's a thing).

It may be on by default for the cameras targeted at a female audience (in a rapidly shrinking market, female bloggers for instance are a big target), otherwise it won't even be available in more specialized or "hardcore" markets, like DSLR or mirrorless (4/3rds, Nikon 1, EOS M etc) for instance. For the anecdote, I bought a shockproof/waterproof compact camera last year and there's nothing so fancy on it.

Think how someone could falsify your entire life...

Then realize you shouldn't define your life based on some digital records.

I don't define my life based on some digital records. But law enforcement (or the executive branch of the US Federal Government, including the NSA) does. And therein lies the problem.

Someone will be convicted (perhaps even without the intervention of a court) based on unimpeachable but falsified digital records.

That sounds like the fly in the printer at the start of (the movie) Brazil.

Which of course leads to the conviction and torture of an innocent....

The fact that this conversation is taking place makes it unlikely for those records to be regarded as "unimpeachable".

Accept it as an inevitability and develop skills that enable you to adapt, react & procreate in a multitude of diverse situations.

With personal video recording (a la Google Glass and friends) it won't be long before we're subjected to this sort of thing. It's amazing how close we're getting to Ghost in the Shell and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening.

  and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening
Done! Actually a few years ago. Even though I suspect that you have something more complex in mind then splicing in a TV signal:


https://vimeo.com/29279198 real time face substitution

I can't quite see the reason why you would lossily compress something when your machine's purpose is to duplicate things.

Anyone got a reasonable reason for doing this?

Good point. Looking at a product page (http://www.office.xerox.com/multifunction-printer/color-mult...), I see that the first model mentioned is multifunction, it can "Copy, email, fax, print, [and] scan".

So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).

CCITT Group 4 also known as Modified Modified Read is in no way something you'd be wanting to use now ever. I'm telling you why MMR sucks hard:

1. It's monochrome. No greyscale, no color. This works for text and lines, but nothing else. No big surprise, it was designed for Fax. But this makes CCITT G3 and G4 lossy.

2. It has no defined endianess. This adds another fault risk which you won't see coming as long as you're working on an isolated platform but can hit you in the nuts when you change hardware or software.

3. The data does not contain resolution or dimensional information, as well as no information about endianess This means that you have to rely on a container providing these informations. It could be TIFF, it could be PDF, it could be something an intern coded during coffee break. This is good on one hand, but evil on the other. Software is sold, saying CCITT G4 compression (a standard, after all) is used, while the data can be embedded in proprietary containers.

4. It's a 2D compression, meaning the compression is applied on a matrix of binary pixel data. As the standard does not specify the dimensions, you depend on another image container like TIFF to provide information. Because G4 removed EOL markers, there is no way to reconstruct image dimensions from the compressed data alone.

5. It's not exactly fault tolerant. Transmission errors can influence larger areas of the image up to making the picture totally unreadable. Flipped bits are not too critical, missing bits are, due to the 2D compression.

There are many excellent, fault tolerant, standardized Image formats ready to use for document processing and archiving, CCITT G4 isn't exactly one of them.

Errr, I didn't communicate clearly.

What I meant to say was that CCITT Group IV gave acceptable sizes for early-'90s computing power, CPU and disk, and something at or better than its level of lossless compression today should be even more acceptable.

And in light of this screwup, I suspect we'd agree that Xerox would have been better off to use lossless (well, after the scanning, as you point out, but then again no one was willing to pay for color) CCITT Group IV than overly clever lossy JBIG2.

"It could be TIFF, it could be PDF, it could be something an intern coded during coffee break."

It could be something a journeyman software engineer edging to expect coded in a Saturday afternoon in a very fast paced project; for me, 3 weeks on the "engine". And, oh my, I can't remember encoding endiness, except of course for the leading TIFF bytes. But I had a guy who knew this cold telling me what to do, he was the one who debugged all our raw compressed data problems bit by bit. And, yeah, it was an "Intel" little endian TIFF, and I think I recall the Kodak Powescans produced that (600 pound monsters that could scan 18 inches per second at 200 DPI).

Hmmm, at least back then, "TIFF" was the selling point, and, oh yeah, it's Group IV compressed (except of course when it wasn't, we once dealt with some weird enhanced Group III).

Of course they would have been better off with T.6, as Group 4 at least did not modify the image content. However especially with TIFF there are/were countless implementations of viewers, components, libraries and every single one of them had their own habits. Some would not regard endianess, some would assume payload endianess is the same as the TIFF, some did respect the Tag for byte order specific to the image. When I coded my first TIFF Library, I was around 14 and the most troublesome part of doing it was keeping myself from bashing my head against the next available wall due to stupidity of other people who thought interpreting a standard according to their wishes was ok, because there'd never be someone trying to display the images with a viewer different from theirs.

I don't know how deep you have dived into TIFF, but maybe you remember the TIFF6 Standard way of embedding JPEG. It was the biggest pain in the ass imaginable, having to parse JPEG files, splitting them and packaging it into different TIFF Tags. Before TTN2 and easy embedding of JPEG Images, everyone invented their own way of avoiding the standard. Some defined their own compression type, some used the standard compression type, but used it in a nonstandard way, ah, I'm starting to lose my hair again ;-)

Not that deeply, I only did B&W document imaging, and I think the last time I worked on TIFF headers and tags was in 1992, so it was almost certainly the 5.0 standard, 6.0 came out in that year.

And yeah, it was a mess; we mostly did the best we could and made sure the ones we generated worked for our customer's reader(s). Although I don't remember any big problems with people reading the ones we produced.

In the good old days of analog copiers this would be impossible - the scanner send the light through a system of mirrors to the drum, the drum gets static charged, the toner is pulled on the charged parts and gets transferred to the transfer belt, here the paper has the opposite charge and pulls the toner off of the transfer belt, goes through the fusing unit and here is the toner 'burned' to the paper. End of Story

On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it). From that hard disk the data where transmitted via laser to the drum

Tadaaa - you have the reason for having data be compressed on a modern copier.

Yup, and those old analog copiers - good ones at least - had beautiful crisp output. The resolution was good enough to reproduce printing dots so they could even duplicate photos from books. Continuous tone of an analog photograph didn't work as well. They sure were expensive though.

I might have missed something, but my reading is that the article doesn't state or imply this happens with regular photocopies, only with scans to PDF.

Others have pointed out a credible explanation: to have the document take less space on their hard disk.

However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.

I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.

Well we have 14TiB of financial documents archived on our kit. There is no way we even would consider such compression!!!

The whole thing is dangerous and wholly illogical.

This is akin to a crappy crime flick where someone hits the "enhance!" button on a CCTV still a few times and gets to see the dirt on the guy's teeth.

In this case, the computer decides the guy is female and has no teeth.

IIRC when security cameras moved from individual frame compression algorithms like M-JPEG to modern codecs which could sometimes replace small movement in background with still image if there is a bigger change in foreground there were news reports about some problems with investigations.

If you're scanning a long document to a PDF, compression makes a lot of sense. It's the difference between being able to email the PDF as an attachment and having to find a place to put the file online.

Exactly, and that is why there should be a compression step on the code path that handles the paper -> pdf case. This doesn't make any sense in a paper -> paper case, however, as any electronic version of the image will only be stored internally, for a very brief time.

It is amazing the things people expect to be emailed. Can you email me the MRI scan? It's over 1500 images. Can you email it?

It's amazing to me the engineers who refuse to update their worldviews about normal people's mental models for "sending data" that still get amazed by this.

The size of the files or the number of them are totally irrelevant.

Normal people seem to get that it's considerably harder to ship a barn than a letter, and that if you want to move a barn you use a specialty service rather than the post office.

The size and number of files are and should be totally relevant even to "normal" people. When someone asks for something in e-mail, it's perfectly reasonable to say "no, it's much too big" and expect them to understand.

But we're not dealing with barns or letters or any physical object, we're dealing with abstract systems where the physics are much more flexible and changeable. It's important to change our computer systems to work for us, rather than attempting to change people to adapt to the computer systems. We should discard those systems that can not adapt to humanity, as they are of little worth in the long run.

When somebody says, "can you email it to me" they mean, grant access to the data via their centralized messaging system, their email. There are many ways to make that happen, one of which is an attachment, another of which is linking to the content, but the key is to make sure that it's low friction and takes very little time or clicks to get access from the email.

There are some pieces of content for which it's entirely impractical to "grant access via e-mail". On occasion people ask to be e-mailed extremely large blocks of data, where it would literally be faster to burn it to a pile of DVDs and then FedEx them than to upload-and-then-download the data. Depending on the size of the medical images mentioned in a previous post, that might actually be the case in that circumstance.

It's a failure of technology when it's difficult to send ordinary-sized files like a few photos or a couple pages of documents. But it's a failure of people when they don't recognize the possibility that some types of data (video, large numbers of images, scientific research data, whole databases) simply can't be sent quickly, yet they fail to plan ahead to gain access. (I've also entirely skirted the issue of "some data should have its access restricted physically"...)

> But it's a failure of people when they don't recognize the possibility that some types of data (video, large numbers of images, scientific research data, whole databases) simply can't be sent quickly, yet they fail to plan ahead to gain access. (I've also entirely skirted the issue of "some data should have its access restricted physically"...)

I disagree vehemently with that attitude, and I have to deal with it everyday. In my field, >50% of the data we receive is transferred by overnight courier of hard drives due to quantity of data. It's a crappy attitude to blame people for having to learn that, and in an ideal world we'd share it via access granted by email. People should not be blamed for not understanding that, our infrastructure should be blamed for not supporting 10Gb everywhere, and cheap access to 40Gb+ on long-distance connections.

Nothing is helped by blaming people, and relationships can be harmed by doing that. But we can change the technology.

As a side note, DVDs? Really? They're incredibly slow at data transfer once you have them in hand, the tiny size of a DVD requires tricky archive spanning methods, and optical discs are flaky technology all around. Hard drives or LTO-5/6 all the way.

I think most people would understand if you asked them how long it would take to download a million large photos, given that one large photo often takes several seconds to complete. They'd realize that this might be a slow process.

Incidentally, it's not just about transfer speed. Sometimes people ask if you can e-mail something that has never been put on a computer, and would take weeks or months to scan in. Or sometimes they ask for access to information when access is very slow to set up due to security or privacy considerations. Or sometimes they ask for access to something that the boss needs to physically sign off on, after the boss has gone home for the day. This is only a problem if they've decided it's urgent to have it, and simply haven't thought ahead about how it might not necessarily be possible to get instant access to every piece of information that ever existed.

We can change technology. But we also need to retain the mindset of arranging access beforehand. It's not about "blaming", it's simply about encouraging people to understand what they're asking for and to make sure they get the access they need before they need it.

[As an aside, DVDs are just an example of "sometimes it's really freaking slow to download data" that somebody like my mom would get. An alternative way to phrase it would be "downloading that would be so slow, it'd be better to just have your friend bring her laptop over." I certainly don't intend to suggest a new industry standard.]

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway

Andrew Tanenbaum

It would be helpful if file managers gave better cues as to file size. A barn is obviously different to a letter, but a one byte file is normally given the same icon as a one terabyte file.

I think we can all agree that e-mails should have finite size - it's not a very good protocol for transferring multi-gigabyte files, for sure! Where we would disagree is where that limit should be drawn.

I've seen systems in this day and age that fail in the face of e-mails as small as 5 megabytes (e.g. Yahoo Popgate) which IMHO is far too low - but evidently some sysadmins disagree with me!

Email size is a technical issue that shouldn't be limiting (or even visible) to the end user. If an end user wants to send a multigigabyte file to another user's email address - why not? The email client could launch a background upload process and email a link to get that file by, say, bittorrent... Some protocol extensions and software support would be needed, but that can be done and, as users need it, probabpy should be done.

Oh, the protocols are already in place; consider RFC2017.

Sending a link to something (even wrapped in a nice ui and container) has pretty different semantics from actually sending the something, though.

Opinions run the gamut. I'm firmly in the "email should be plain text" camp but realized that battle was lost long ago.

Just to make it clear, I'm not an engineer (I'd like to be good enough to be considered one, but that's a long way off). I'm more of an enthusiastic amateur who knows enough to badly break things. And the people who ask most frequently ask are ortho surgeons with a patient asleep on the table. Anyone who waits until that late in the piece then asked for a 1.5gig email becuase they weren't organised enough to sort out access to images in the 3 month lead up to an operation is not a normal person.

We developed an easy way to email and collaboratively view CT or MRI studies. http://www.claripacs.com.

Thanks, I'll be looking into that.

Emailing the file or emailing a link to the file is just as useful. As long as you have an encrypted document sharing capability you should be able to say sure to just about anything.

PS: Granted that's assuming fast networks. For 80+ Gig VM's sending a removable drive is often faster.

For a lot of end users, if it's not a proper attachment no end of grief is caused.

And encryption and file hosting causes more hassle in enterprise environments with unforgiving compliance policies.

The article has been updated with the probable cause for this error.

Cheaper components, maybe? (If it lets them get by with less memory for example.)

This should be on the computer risks digest.

There is virtually no reason whatsoever for this problem to exist. This is the domain of "making a problem more risky and complicated than it needs to be" and royally screwing people in the process.

Might as well throw the paperwork in a bin and set fire to it.

Sufficiently advanced bugs are indistinguishable from sabotage.

And the converse: sufficiently clever sabotage is indistinguishable from a bug, as evidenced by the "Xerox copier randomly prints penises" prank: https://news.ycombinator.com/item?id=6157422

Geeze. This could result in some catastrophic errors. An order for 900 servers instead of 200. $7M loss instead of $1M in your quarterly earnings. Pricing your product at $3 instead of $8. Makes you realize you need some redundancy and double-checks for important communications.

Especially considering that faxes, copies, and scans of documents are legally the same as the originals, at least for ordinary business purposes.

I don't think it's necessarily an issue of inexcusable incompetence: it seems like one of those faults which is obvious in retrospect but very difficult to predict. Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers? That would seem to be a safer choice than writing a lossy compression algorithm from scratch. QA testing probably was on the order of 'picture looks right'; after all, why bother testing that the semantics of the copied content match the original when what you're building is a bitmap duplicator? (Of course, the OCR stuff would be tested more rigorously, but this explicitly bypasses that piece). It's not hard to see the chain of individually reasonable decisions that could lead to something like this.

The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.

> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?

I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.

> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.

Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?

> QA testing probably was on the order of 'picture looks right';

Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.

JBIG2 is hardly obscure. It is billed just as prominently on the official JPEG site as JPEG and JPEG2000.

It is useless to someone that wants to compress arbitrary images, since it is bi-level only, I'd ignore it too if I wanted to compress a photograph. Not having an open specification hurts. The "last draft" is available, but the final was sacrificed to someone's business model.

You are right, I was just saying I was playing with image compression and just hadn't found JBIG2. Also probably because it has a patent associated with it and it is mainly for bi-level images.

> Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.

You need to put your corporate drone hat on. How many people are involved in making a Xerox copier? How many parts are reused from the previous model? How much software is reused?

My best guess is that a large number of components in a copier are engineered in isolation. The image compression people responsible for implementing JBIG2 probably don't even care about correctness beyond some threshold ("not my problem"). The people responsible for ensuring correct copying may not even know that an image compression exists, and even if they do, may not understand the technical nuances of JBIG2, and also may not even have the right documents to find an instance of such a problem.

> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?

The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.

A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.

I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.

JBIG2 compression is in no way a standard compression algorithm, as the standard only describes decompression. The compression depends on the implementation. And this is where incompetence comes back into the game.

Ouch, imagine this happens in a hospital with a prescription or something. It could really have some serious implications.

Indeed, I keep a copy of my lab results for the last N years because they sometimes get lost, once through no real fault of the doctor (http://en.wikipedia.org/wiki/2011_Joplin_tornado).

Grrr, I'm now going to have to view every lab report that's not an original with suspicion, and make sure my doctors aren't making recommendations due to screwed up copies.

Lossy compression is not an acceptable default for a general purpose device.

This isn't even lossy compression - it's misleading compression

It's one of the worst examples of "seamless design"[1] I have ever seen.

[1] http://jim-mcbeath.blogspot.co.uk/2008/11/seamful-design.htm...

Cached copy, which is missing the updated content: http://webcache.googleusercontent.com/search?q=cache%3Awww.d...

Thanks. As of now, the cache seems to contain the update. It begins with:

Edit: In the last section, it is now sketched what the reasons for the issue may be, on the basis of several emails I got.

I bet the cached copy is subtly different. Fool me once...

If I were a sentient network and wanted to cause panic among the humans, as a prelude to full-blown warfare, this is how I'd start. Let's send all those Xerox copiers to Guantanamo, they are obviously terrorists.

My first thought was, "I wonder if this has anything to do with copy protections related to anti counterfeiting?"

Not that I have any valid reasons to consider this.

Agreed. I worried that this might be yet another example of printers and scanners doing strange things, like <https://en.wikipedia.org/wiki/Printer_steganography> or <https://en.wikipedia.org/wiki/EURion_constellation>. Glad to see that this can be ascribed to incompetence rather than malice.

Given the challenges of JBIG2 it seems one should be able to construct a 'test' page which, when scanned, will test the algorithm's accuracy.

Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.

Wow, how terrifically and fundamentally negligent. Let's hope nobody dies — the potential hazards seem almost endless.

Humm, I use one of these to create PDFs of reciepts to attach to my exense reports.

That's one hell of an error. It is literally better for these machines never to have existed at all.

Just an update: the author states on Twitter that he already had notified Xerox a week ago [1]. Apparently, Xerox has only now contacted him because they thought it was a joke [2] ...

[1] https://twitter.com/davidkriesel/status/364345036407709697

[2] https://twitter.com/davidkriesel/status/364329334300880896

Reminded me of this:

"Digital Photocopiers Loaded With Secrets" http://www.youtube.com/watch?v=Wa0akU8bsOQ

Don't all these kinds of machines have a scrub disk option? Or just take the disk out and scrub it.

Possibly. The focus of the story was that (at least at the time) many of the owners/leasers of these machines had no idea they contained drives that retained the scanned documents.

Now that's a bug I wouldn't like being responsible for

I don't think the programmer who coded it is to blame. The manager who (very likely) cut the QA needed to save a few bugs to find it is.

This is a massive error- on the order of Intel's FDIV bug.

Wow. I cannot imagine how much chaos this could cause.

Do we know the scope of likely affected printers? The company I work at runs a whole heap (~80) of WorkCentre 3220, 4150 and 4250s, as well as ApeosPorts, etc.

I shudder to think how much we've scanned that could be affected by this. Thankfully, I think all of our engineering drawings (which for a decade+ were printed, signed, then scanned when needed for digital issue) were done on a non-xerox device, but all of our standard A3/A4 business stuff is done on Xerox devices.

Minor correction: The article says that the JBIG2 patch size might be the size of the scanned text. JBIG2 actually has the capability to detect regions of text and compress them using a specialized technique that operates on individual symbols.

I suspect Xerox is using this option and their implementation is getting confused (perhaps by the low resolution). Unless I'm greatly mistaken, the patch size for normal compression shouldn't figure here.

I was confused by that as well. From what I understood how JBIG2 worked, those symbols don't even have to have the same size everywhere (as would be quite common with proportional fonts anyway). So there is no "patch size" per se; just the low resolution confusing the classifier.

I doubt the patch size is even configurable, as identified patterns can be scaled accordingly. However the author is not to blame, because JBIG2 is poorly documented and the implementation of the compressor is not specified in the standard.

Now to the important question: How can i easily assert that my scanner, or the next scanner i buy, does not have the same issue?

Scan in black & white, scan in grayscale, reduce grayscale to bi-level and compare.

Your honour, my computer was hacked. Oh, you don't believe that? Well then, I used a Xerox copier!

That's what you get when you use lossy compression, and it's hardly a problem unique to Xerox scanners. Maybe important documents should be scanned to a higher resolution so you don't have problems like this.

Could you share with us a list of other scanners that have this problem, so we can avoid them?

I tried various compression and density settings on my Fujitsu scanner, and didn't see any problems like those mentioned in the article.

Thanks. I scan a ton of documents with my Fujitsu scanner, so that's particularly relevant to me.

I doubt Xerox would be arrogant enough to try shifting responsibility to the victim like that. Do you reverse engineer every product you use just to confirm that the designer didn't cut corners to make it work differently from every other example of a familiar class of product?

It's particularly absurd in this case since it's clearly not easy to learn that lossy compression is being applied or how one would disable it if they wanted their Xerox to work like every other copier/fax they've used.

TIFF with lossless compression all the way. followed up with OCR if necessary.

Has anyone recreated this issue? I haven't been able to.

On a xerox 7535 I was able to recreate the problem when using the example sheet of numbers provided in the tiff image.

Ack! Looks like overclever compression in a domain where it's not always desired, let alone required.

I spent half a decade on document imaging in the early to mid '90s, a fair amount close to this level (had a coworker who loved bit level stuff for the truly evil problems like this), and I can see how it happened ... given sufficiently careless developers.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact