This class of error is called (by me, at least) a "contoot" because, long ago, when I was writing the JBIG2 compressor for Google Books PDFs, the first example was on the contents page of book. The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...
Actually, that doesn't matter all that much. You ought to scan it into a TIFF file and then process it the way you want it. If you want a good JBIG2 compressor according to your liking, you have to write it yourself anyway, I don't think that the printer hardware and SW is up to that task.
The idea is actually very smart: given the infinite (and multidimensional) space of encoder solutions, fixing the bit encoding and the decompression process was very smart. It's like with PDF: it's well defined how to draw it into a bitmap but you're not constrained as to how you generate the layout, what line break algorithm you use etc.
The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.
Although, I realize that on "Google scale", such a complex solution could be a problem.
This was predictable. JBIG2 is in no way secure for document processing, archiving or whatsoever. The image is sliced into small areas and a probabilistic matcher finds other areas that are similar. This way similar areas only have to be stored once.
Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.
I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.
Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.
This will make lawsuits a lot of fun in the future.
This will make lawsuits a lot of fun in the future.
Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"
The trouble is, there is no reasonable in doubt anymore. Copying and digital archiving both rely on the premise that there is no manipulation. Lossy compression always seemed to be OK because the image quality was reduced without changing the integrity and structure of the image. This will essentially destroy credibility of digital records. Every shyster and hack lawyer will pull this as defense in court.
Also it's not like there is a reference implementation for encoding JBIG2 everyone uses. We're talking about proprietary libraries which do the compression. These libraries are compared using performance indicators like speed, memory usage, etc. This gives sloppy crap implementations an advantage, because (and I'd bet on that) when the implementation was chosen, the deciders didn't even have the idea that a compression could actually manipulate the document content. Automated testing of compression algorithms is hard, because by design there can never be 100% proof, as the output image is different from the input image. If the comparison is broken, the test will fail to identify errors.
The critical failure in the design was thinking that some sort of algorithm performs equally or better than the human brain at recognizing text in low quality. This is - up to now - not the case.
Text won't be a big issue as mistakes are kinda easy to spot. Also it's less probable to have image fragments that seem similar but really aren't. The Algorithm isn't really smart, it's mostly just pattern matching due to performance constraints. Thanks to kerning (variation in distance between individual Characters), I doubt that swapping of Words or sentences will occur a lot, unless the threshold for reaching significance in the comparing algorithm is higher than the guy was while designing it.
The real trouble starts when looking at numbers. Numbers are usually typeset monospaced, right aligned and/or in tables. The possible variations are pretty low, each digit represents 10 different possible meanings. Text documents are usually scanned at a pretty low resolution, because for a human it's still possible to distinguish between characters and numbers, even when a lot of information is lost. As already mentioned, algorithms cannot do this.
The next problem is: We can spot mistakes in text because there are syntactic and semantic rules which we more or less understand. While reading, our subconscience validates the input, obvious errors will pop out. When it comes to numbers, there is no such thing. A number cannot be validated without additional knowledge. And as document processing is one of the labour intensive tasks, mostly executed by minimum wage clerks, there is no way in hell a mistake would be spotted before the documents are archived for all eternity.
Let's put on the tinfoil hat for a moment:
If someone wanted to really fuck up a company, they could just flash the printer/scanner/copier firmware, changing parameters of the compression.
> Every shyster and hack lawyer will pull this as defense in court.
I don't think it will be so trivial to use this defense. As somebody claimed, JBIG2 _reuses_ sufficiently similar blocks, so I guess it can be relatively easily determined whether the document has been messed up by lossy compression.
"The image is sliced into small areas and a probabilistic matcher finds other areas that are similar."
"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."
If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.
A very tech savvy friend bought a camera in Japan and after about a week or started delving into settings. He thought all the faces looked wrong. He found a setting that made the eyes bigger and rounder. It was subtle, but quite funny at the same time.
Most of the consumer compact cameras have a "purikura" setting or a "beauty" setting with special treatment for the skin, whiter eyes and whiter teeths, and eventually bigger eyes and smaller mouth (yes, that's a thing).
It may be on by default for the cameras targeted at a female audience (in a rapidly shrinking market, female bloggers for instance are a big target), otherwise it won't even be available in more specialized or "hardcore" markets, like DSLR or mirrorless (4/3rds, Nikon 1, EOS M etc) for instance. For the anecdote, I bought a shockproof/waterproof compact camera last year and there's nothing so fancy on it.
With personal video recording (a la Google Glass and friends) it won't be long before we're subjected to this sort of thing. It's amazing how close we're getting to Ghost in the Shell and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening.
So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).
CCITT Group 4 also known as Modified Modified Read is in no way something you'd be wanting to use now ever. I'm telling you why MMR sucks hard:
1. It's monochrome. No greyscale, no color. This works for text and lines, but nothing else. No big surprise, it was designed for Fax. But this makes CCITT G3 and G4 lossy.
2. It has no defined endianess. This adds another fault risk which you won't see coming as long as you're working on an isolated platform but can hit you in the nuts when you change hardware or software.
3. The data does not contain resolution or dimensional information, as well as no information about endianess
This means that you have to rely on a container providing these informations. It could be TIFF, it could be PDF, it could be something an intern coded during coffee break. This is good on one hand, but evil on the other. Software is sold, saying CCITT G4 compression (a standard, after all) is used, while the data can be embedded in proprietary containers.
4. It's a 2D compression, meaning the compression is applied on a matrix of binary pixel data. As the standard does not specify the dimensions, you depend on another image container like TIFF to provide information. Because G4 removed EOL markers, there is no way to reconstruct image dimensions from the compressed data alone.
5. It's not exactly fault tolerant. Transmission errors can influence larger areas of the image up to making the picture totally unreadable. Flipped bits are not too critical, missing bits are, due to the 2D compression.
There are many excellent, fault tolerant, standardized Image formats ready to use for document processing and archiving, CCITT G4 isn't exactly one of them.
What I meant to say was that CCITT Group IV gave acceptable sizes for early-'90s computing power, CPU and disk, and something at or better than its level of lossless compression today should be even more acceptable.
And in light of this screwup, I suspect we'd agree that Xerox would have been better off to use lossless (well, after the scanning, as you point out, but then again no one was willing to pay for color) CCITT Group IV than overly clever lossy JBIG2.
"It could be TIFF, it could be PDF, it could be something an intern coded during coffee break."
It could be something a journeyman software engineer edging to expect coded in a Saturday afternoon in a very fast paced project; for me, 3 weeks on the "engine". And, oh my, I can't remember encoding endiness, except of course for the leading TIFF bytes. But I had a guy who knew this cold telling me what to do, he was the one who debugged all our raw compressed data problems bit by bit. And, yeah, it was an "Intel" little endian TIFF, and I think I recall the Kodak Powescans produced that (600 pound monsters that could scan 18 inches per second at 200 DPI).
Hmmm, at least back then, "TIFF" was the selling point, and, oh yeah, it's Group IV compressed (except of course when it wasn't, we once dealt with some weird enhanced Group III).
Of course they would have been better off with T.6, as Group 4 at least did not modify the image content. However especially with TIFF there are/were countless implementations of viewers, components, libraries and every single one of them had their own habits. Some would not regard endianess, some would assume payload endianess is the same as the TIFF, some did respect the Tag for byte order specific to the image. When I coded my first TIFF Library, I was around 14 and the most troublesome part of doing it was keeping myself from bashing my head against the next available wall due to stupidity of other people who thought interpreting a standard according to their wishes was ok, because there'd never be someone trying to display the images with a viewer different from theirs.
I don't know how deep you have dived into TIFF, but maybe you remember the TIFF6 Standard way of embedding JPEG. It was the biggest pain in the ass imaginable, having to parse JPEG files, splitting them and packaging it into different TIFF Tags. Before TTN2 and easy embedding of JPEG Images, everyone invented their own way of avoiding the standard. Some defined their own compression type, some used the standard compression type, but used it in a nonstandard way, ah, I'm starting to lose my hair again ;-)
Not that deeply, I only did B&W document imaging, and I think the last time I worked on TIFF headers and tags was in 1992, so it was almost certainly the 5.0 standard, 6.0 came out in that year.
And yeah, it was a mess; we mostly did the best we could and made sure the ones we generated worked for our customer's reader(s). Although I don't remember any big problems with people reading the ones we produced.
In the good old days of analog copiers this would be impossible - the scanner send the light through a system of mirrors to the drum, the drum gets static charged, the toner is pulled on the charged parts and gets transferred to the transfer belt, here the paper has the opposite charge and pulls the toner off of the transfer belt, goes through the fusing unit and here is the toner 'burned' to the paper. End of Story
On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it).
From that hard disk the data where transmitted via laser to the drum
Tadaaa - you have the reason for having data be compressed on a modern copier.
Yup, and those old analog copiers - good ones at least - had beautiful crisp output. The resolution was good enough to reproduce printing dots so they could even duplicate photos from books. Continuous tone of an analog photograph didn't work as well. They sure were expensive though.
Others have pointed out a credible explanation: to have the document take less space on their hard disk.
However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.
I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.
IIRC when security cameras moved from individual frame compression algorithms like M-JPEG to modern codecs which could sometimes replace small movement in background with still image if there is a bigger change in foreground there were news reports about some problems with investigations.
If you're scanning a long document to a PDF, compression makes a lot of sense. It's the difference between being able to email the PDF as an attachment and having to find a place to put the file online.
Exactly, and that is why there should be a compression step on the code path that handles the paper -> pdf case. This doesn't make any sense in a paper -> paper case, however, as any electronic version of the image will only be stored internally, for a very brief time.
Normal people seem to get that it's considerably harder to ship a barn than a letter, and that if you want to move a barn you use a specialty service rather than the post office.
The size and number of files are and should be totally relevant even to "normal" people. When someone asks for something in e-mail, it's perfectly reasonable to say "no, it's much too big" and expect them to understand.
But we're not dealing with barns or letters or any physical object, we're dealing with abstract systems where the physics are much more flexible and changeable. It's important to change our computer systems to work for us, rather than attempting to change people to adapt to the computer systems. We should discard those systems that can not adapt to humanity, as they are of little worth in the long run.
When somebody says, "can you email it to me" they mean, grant access to the data via their centralized messaging system, their email. There are many ways to make that happen, one of which is an attachment, another of which is linking to the content, but the key is to make sure that it's low friction and takes very little time or clicks to get access from the email.
There are some pieces of content for which it's entirely impractical to "grant access via e-mail". On occasion people ask to be e-mailed extremely large blocks of data, where it would literally be faster to burn it to a pile of DVDs and then FedEx them than to upload-and-then-download the data. Depending on the size of the medical images mentioned in a previous post, that might actually be the case in that circumstance.
It's a failure of technology when it's difficult to send ordinary-sized files like a few photos or a couple pages of documents. But it's a failure of people when they don't recognize the possibility that some types of data (video, large numbers of images, scientific research data, whole databases) simply can't be sent quickly, yet they fail to plan ahead to gain access. (I've also entirely skirted the issue of "some data should have its access restricted physically"...)
> But it's a failure of people when they don't recognize the possibility that some types of data (video, large numbers of images, scientific research data, whole databases) simply can't be sent quickly, yet they fail to plan ahead to gain access. (I've also entirely skirted the issue of "some data should have its access restricted physically"...)
I disagree vehemently with that attitude, and I have to deal with it everyday. In my field, >50% of the data we receive is transferred by overnight courier of hard drives due to quantity of data. It's a crappy attitude to blame people for having to learn that, and in an ideal world we'd share it via access granted by email. People should not be blamed for not understanding that, our infrastructure should be blamed for not supporting 10Gb everywhere, and cheap access to 40Gb+ on long-distance connections.
Nothing is helped by blaming people, and relationships can be harmed by doing that. But we can change the technology.
As a side note, DVDs? Really? They're incredibly slow at data transfer once you have them in hand, the tiny size of a DVD requires tricky archive spanning methods, and optical discs are flaky technology all around. Hard drives or LTO-5/6 all the way.
I think most people would understand if you asked them how long it would take to download a million large photos, given that one large photo often takes several seconds to complete. They'd realize that this might be a slow process.
Incidentally, it's not just about transfer speed. Sometimes people ask if you can e-mail something that has never been put on a computer, and would take weeks or months to scan in. Or sometimes they ask for access to information when access is very slow to set up due to security or privacy considerations. Or sometimes they ask for access to something that the boss needs to physically sign off on, after the boss has gone home for the day. This is only a problem if they've decided it's urgent to have it, and simply haven't thought ahead about how it might not necessarily be possible to get instant access to every piece of information that ever existed.
We can change technology. But we also need to retain the mindset of arranging access beforehand. It's not about "blaming", it's simply about encouraging people to understand what they're asking for and to make sure they get the access they need before they need it.
[As an aside, DVDs are just an example of "sometimes it's really freaking slow to download data" that somebody like my mom would get. An alternative way to phrase it would be "downloading that would be so slow, it'd be better to just have your friend bring her laptop over." I certainly don't intend to suggest a new industry standard.]
I think we can all agree that e-mails should have finite size - it's not a very good protocol for transferring multi-gigabyte files, for sure! Where we would disagree is where that limit should be drawn.
I've seen systems in this day and age that fail in the face of e-mails as small as 5 megabytes (e.g. Yahoo Popgate) which IMHO is far too low - but evidently some sysadmins disagree with me!
Email size is a technical issue that shouldn't be limiting (or even visible) to the end user. If an end user wants to send a multigigabyte file to another user's email address - why not? The email client could launch a background upload process and email a link to get that file by, say, bittorrent... Some protocol extensions and software support would be needed, but that can be done and, as users need it, probabpy should be done.
Just to make it clear, I'm not an engineer (I'd like to be good enough to be considered one, but that's a long way off). I'm more of an enthusiastic amateur who knows enough to badly break things. And the people who ask most frequently ask are ortho surgeons with a patient asleep on the table. Anyone who waits until that late in the piece then asked for a 1.5gig email becuase they weren't organised enough to sort out access to images in the 3 month lead up to an operation is not a normal person.
Ack! Looks like overclever compression in a domain where it's not always desired, let alone required.
I spent half a decade on document imaging in the early to mid '90s, a fair amount close to this level (had a coworker who loved bit level stuff for the truly evil problems like this), and I can see how it happened ... given sufficiently careless developers.
Geeze. This could result in some catastrophic errors. An order for 900 servers instead of 200. $7M loss instead of $1M in your quarterly earnings. Pricing your product at $3 instead of $8. Makes you realize you need some redundancy and double-checks for important communications.
I don't think it's necessarily an issue of inexcusable incompetence: it seems like one of those faults which is obvious in retrospect but very difficult to predict. Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers? That would seem to be a safer choice than writing a lossy compression algorithm from scratch. QA testing probably was on the order of 'picture looks right'; after all, why bother testing that the semantics of the copied content match the original when what you're building is a bitmap duplicator? (Of course, the OCR stuff would be tested more rigorously, but this explicitly bypasses that piece). It's not hard to see the chain of individually reasonable decisions that could lead to something like this.
The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?
I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.
> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.
Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?
> QA testing probably was on the order of 'picture looks right';
Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.
JBIG2 is hardly obscure. It is billed just as prominently on the official JPEG site as JPEG and JPEG2000.
It is useless to someone that wants to compress arbitrary images, since it is bi-level only, I'd ignore it too if I wanted to compress a photograph. Not having an open specification hurts. The "last draft" is available, but the final was sacrificed to someone's business model.
> Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.
You need to put your corporate drone hat on. How many people are involved in making a Xerox copier? How many parts are reused from the previous model? How much software is reused?
My best guess is that a large number of components in a copier are engineered in isolation. The image compression people responsible for implementing JBIG2 probably don't even care about correctness beyond some threshold ("not my problem"). The people responsible for ensuring correct copying may not even know that an image compression exists, and even if they do, may not understand the technical nuances of JBIG2, and also may not even have the right documents to find an instance of such a problem.
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?
The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.
A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.
I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.
JBIG2 compression is in no way a standard compression algorithm, as the standard only describes decompression. The compression depends on the implementation. And this is where incompetence comes back into the game.
If I were a sentient network and wanted to cause panic among the humans, as a prelude to full-blown warfare, this is how I'd start. Let's send all those Xerox copiers to Guantanamo, they are obviously terrorists.
Given the challenges of JBIG2 it seems one should be able to construct a 'test' page which, when scanned, will test the algorithm's accuracy.
Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.
Do we know the scope of likely affected printers? The company I work at runs a whole heap (~80) of WorkCentre 3220, 4150 and 4250s, as well as ApeosPorts, etc.
I shudder to think how much we've scanned that could be affected by this. Thankfully, I think all of our engineering drawings (which for a decade+ were printed, signed, then scanned when needed for digital issue) were done on a non-xerox device, but all of our standard A3/A4 business stuff is done on Xerox devices.
Minor correction: The article says that the JBIG2 patch size might be the size of the scanned text. JBIG2 actually has the capability to detect regions of text and compress them using a specialized technique that operates on individual symbols.
I suspect Xerox is using this option and their implementation is getting confused (perhaps by the low resolution). Unless I'm greatly mistaken, the patch size for normal compression shouldn't figure here.
I was confused by that as well. From what I understood how JBIG2 worked, those symbols don't even have to have the same size everywhere (as would be quite common with proportional fonts anyway). So there is no "patch size" per se; just the low resolution confusing the classifier.
I doubt the patch size is even configurable, as identified patterns can be scaled accordingly. However the author is not to blame, because JBIG2 is poorly documented and the implementation of the compressor is not specified in the standard.
That's what you get when you use lossy compression, and it's hardly a problem unique to Xerox scanners. Maybe important documents should be scanned to a higher resolution so you don't have problems like this.
I doubt Xerox would be arrogant enough to try shifting responsibility to the victim like that. Do you reverse engineer every product you use just to confirm that the designer didn't cut corners to make it work differently from every other example of a familiar class of product?
It's particularly absurd in this case since it's clearly not easy to learn that lossy compression is being applied or how one would disable it if they wanted their Xerox to work like every other copier/fax they've used.