The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
More than you want to know about this topic can be found here: https://www.imperialviolet.org/binary/google-books-pdf.pdf
The idea is actually very smart: given the infinite (and multidimensional) space of encoder solutions, fixing the bit encoding and the decompression process was very smart. It's like with PDF: it's well defined how to draw it into a bitmap but you're not constrained as to how you generate the layout, what line break algorithm you use etc.
The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.
Although, I realize that on "Google scale", such a complex solution could be a problem.
Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.
I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.
Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.
This will make lawsuits a lot of fun in the future.
Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"
Also it's not like there is a reference implementation for encoding JBIG2 everyone uses. We're talking about proprietary libraries which do the compression. These libraries are compared using performance indicators like speed, memory usage, etc. This gives sloppy crap implementations an advantage, because (and I'd bet on that) when the implementation was chosen, the deciders didn't even have the idea that a compression could actually manipulate the document content. Automated testing of compression algorithms is hard, because by design there can never be 100% proof, as the output image is different from the input image. If the comparison is broken, the test will fail to identify errors.
The critical failure in the design was thinking that some sort of algorithm performs equally or better than the human brain at recognizing text in low quality. This is - up to now - not the case.
Text won't be a big issue as mistakes are kinda easy to spot. Also it's less probable to have image fragments that seem similar but really aren't. The Algorithm isn't really smart, it's mostly just pattern matching due to performance constraints. Thanks to kerning (variation in distance between individual Characters), I doubt that swapping of Words or sentences will occur a lot, unless the threshold for reaching significance in the comparing algorithm is higher than the guy was while designing it.
The real trouble starts when looking at numbers. Numbers are usually typeset monospaced, right aligned and/or in tables. The possible variations are pretty low, each digit represents 10 different possible meanings. Text documents are usually scanned at a pretty low resolution, because for a human it's still possible to distinguish between characters and numbers, even when a lot of information is lost. As already mentioned, algorithms cannot do this.
The next problem is: We can spot mistakes in text because there are syntactic and semantic rules which we more or less understand. While reading, our subconscience validates the input, obvious errors will pop out. When it comes to numbers, there is no such thing. A number cannot be validated without additional knowledge. And as document processing is one of the labour intensive tasks, mostly executed by minimum wage clerks, there is no way in hell a mistake would be spotted before the documents are archived for all eternity.
Let's put on the tinfoil hat for a moment:
If someone wanted to really fuck up a company, they could just flash the printer/scanner/copier firmware, changing parameters of the compression.
I don't think it will be so trivial to use this defense. As somebody claimed, JBIG2 _reuses_ sufficiently similar blocks, so I guess it can be relatively easily determined whether the document has been messed up by lossy compression.
"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."
If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.
It is like taking a picture of my wife with a digital camera and her face being replaced with that of some other person.
I can imagine someone turning the technique into a novel form of image compression, maybe for surveillance databases or something.
It may be on by default for the cameras targeted at a female audience (in a rapidly shrinking market, female bloggers for instance are a big target), otherwise it won't even be available in more specialized or "hardcore" markets, like DSLR or mirrorless (4/3rds, Nikon 1, EOS M etc) for instance. For the anecdote, I bought a shockproof/waterproof compact camera last year and there's nothing so fancy on it.
Someone will be convicted (perhaps even without the intervention of a court) based on unimpeachable but falsified digital records.
Which of course leads to the conviction and torture of an innocent....
and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening
Anyone got a reasonable reason for doing this?
So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).
1. It's monochrome. No greyscale, no color. This works for text and lines, but nothing else. No big surprise, it was designed for Fax. But this makes CCITT G3 and G4 lossy.
2. It has no defined endianess. This adds another fault risk which you won't see coming as long as you're working on an isolated platform but can hit you in the nuts when you change hardware or software.
3. The data does not contain resolution or dimensional information, as well as no information about endianess
This means that you have to rely on a container providing these informations. It could be TIFF, it could be PDF, it could be something an intern coded during coffee break. This is good on one hand, but evil on the other. Software is sold, saying CCITT G4 compression (a standard, after all) is used, while the data can be embedded in proprietary containers.
4. It's a 2D compression, meaning the compression is applied on a matrix of binary pixel data. As the standard does not specify the dimensions, you depend on another image container like TIFF to provide information. Because G4 removed EOL markers, there is no way to reconstruct image dimensions from the compressed data alone.
5. It's not exactly fault tolerant. Transmission errors can influence larger areas of the image up to making the picture totally unreadable. Flipped bits are not too critical, missing bits are, due to the 2D compression.
There are many excellent, fault tolerant, standardized Image formats ready to use for document processing and archiving, CCITT G4 isn't exactly one of them.
What I meant to say was that CCITT Group IV gave acceptable sizes for early-'90s computing power, CPU and disk, and something at or better than its level of lossless compression today should be even more acceptable.
And in light of this screwup, I suspect we'd agree that Xerox would have been better off to use lossless (well, after the scanning, as you point out, but then again no one was willing to pay for color) CCITT Group IV than overly clever lossy JBIG2.
"It could be TIFF, it could be PDF, it could be something an intern coded during coffee break."
It could be something a journeyman software engineer edging to expect coded in a Saturday afternoon in a very fast paced project; for me, 3 weeks on the "engine". And, oh my, I can't remember encoding endiness, except of course for the leading TIFF bytes. But I had a guy who knew this cold telling me what to do, he was the one who debugged all our raw compressed data problems bit by bit. And, yeah, it was an "Intel" little endian TIFF, and I think I recall the Kodak Powescans produced that (600 pound monsters that could scan 18 inches per second at 200 DPI).
Hmmm, at least back then, "TIFF" was the selling point, and, oh yeah, it's Group IV compressed (except of course when it wasn't, we once dealt with some weird enhanced Group III).
I don't know how deep you have dived into TIFF, but maybe you remember the TIFF6 Standard way of embedding JPEG. It was the biggest pain in the ass imaginable, having to parse JPEG files, splitting them and packaging it into different TIFF Tags. Before TTN2 and easy embedding of JPEG Images, everyone invented their own way of avoiding the standard. Some defined their own compression type, some used the standard compression type, but used it in a nonstandard way, ah, I'm starting to lose my hair again ;-)
And yeah, it was a mess; we mostly did the best we could and made sure the ones we generated worked for our customer's reader(s). Although I don't remember any big problems with people reading the ones we produced.
On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it).
From that hard disk the data where transmitted via laser to the drum
Tadaaa - you have the reason for having data be compressed on a modern copier.
However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.
I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.
The whole thing is dangerous and wholly illogical.
This is akin to a crappy crime flick where someone hits the "enhance!" button on a CCTV still a few times and gets to see the dirt on the guy's teeth.
In this case, the computer decides the guy is female and has no teeth.
The size of the files or the number of them are totally irrelevant.
The size and number of files are and should be totally relevant even to "normal" people. When someone asks for something in e-mail, it's perfectly reasonable to say "no, it's much too big" and expect them to understand.
When somebody says, "can you email it to me" they mean, grant access to the data via their centralized messaging system, their email. There are many ways to make that happen, one of which is an attachment, another of which is linking to the content, but the key is to make sure that it's low friction and takes very little time or clicks to get access from the email.
It's a failure of technology when it's difficult to send ordinary-sized files like a few photos or a couple pages of documents. But it's a failure of people when they don't recognize the possibility that some types of data (video, large numbers of images, scientific research data, whole databases) simply can't be sent quickly, yet they fail to plan ahead to gain access. (I've also entirely skirted the issue of "some data should have its access restricted physically"...)
I disagree vehemently with that attitude, and I have to deal with it everyday. In my field, >50% of the data we receive is transferred by overnight courier of hard drives due to quantity of data. It's a crappy attitude to blame people for having to learn that, and in an ideal world we'd share it via access granted by email. People should not be blamed for not understanding that, our infrastructure should be blamed for not supporting 10Gb everywhere, and cheap access to 40Gb+ on long-distance connections.
Nothing is helped by blaming people, and relationships can be harmed by doing that. But we can change the technology.
As a side note, DVDs? Really? They're incredibly slow at data transfer once you have them in hand, the tiny size of a DVD requires tricky archive spanning methods, and optical discs are flaky technology all around. Hard drives or LTO-5/6 all the way.
Incidentally, it's not just about transfer speed. Sometimes people ask if you can e-mail something that has never been put on a computer, and would take weeks or months to scan in. Or sometimes they ask for access to information when access is very slow to set up due to security or privacy considerations. Or sometimes they ask for access to something that the boss needs to physically sign off on, after the boss has gone home for the day. This is only a problem if they've decided it's urgent to have it, and simply haven't thought ahead about how it might not necessarily be possible to get instant access to every piece of information that ever existed.
We can change technology. But we also need to retain the mindset of arranging access beforehand. It's not about "blaming", it's simply about encouraging people to understand what they're asking for and to make sure they get the access they need before they need it.
[As an aside, DVDs are just an example of "sometimes it's really freaking slow to download data" that somebody like my mom would get. An alternative way to phrase it would be "downloading that would be so slow, it'd be better to just have your friend bring her laptop over." I certainly don't intend to suggest a new industry standard.]
I've seen systems in this day and age that fail in the face of e-mails as small as 5 megabytes (e.g. Yahoo Popgate) which IMHO is far too low - but evidently some sysadmins disagree with me!
Sending a link to something (even wrapped in a nice ui and container) has pretty different semantics from actually sending the something, though.
PS: Granted that's assuming fast networks. For 80+ Gig VM's sending a removable drive is often faster.
And encryption and file hosting causes more hassle in enterprise environments with unforgiving compliance policies.
There is virtually no reason whatsoever for this problem to exist. This is the domain of "making a problem more risky and complicated than it needs to be" and royally screwing people in the process.
Might as well throw the paperwork in a bin and set fire to it.
The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.
I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.
> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.
Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?
> QA testing probably was on the order of 'picture looks right';
Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.
It is useless to someone that wants to compress arbitrary images, since it is bi-level only, I'd ignore it too if I wanted to compress a photograph. Not having an open specification hurts. The "last draft" is available, but the final was sacrificed to someone's business model.
You need to put your corporate drone hat on. How many people are involved in making a Xerox copier? How many parts are reused from the previous model? How much software is reused?
My best guess is that a large number of components in a copier are engineered in isolation. The image compression people responsible for implementing JBIG2 probably don't even care about correctness beyond some threshold ("not my problem"). The people responsible for ensuring correct copying may not even know that an image compression exists, and even if they do, may not understand the technical nuances of JBIG2, and also may not even have the right documents to find an instance of such a problem.
The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.
A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.
I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.
Grrr, I'm now going to have to view every lab report that's not an original with suspicion, and make sure my doctors aren't making recommendations due to screwed up copies.
Lossy compression is not an acceptable default for a general purpose device.
Edit: In the last section, it is now sketched what the reasons for the issue may be, on the basis of several emails I got.
Not that I have any valid reasons to consider this.
Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.
"Digital Photocopiers Loaded With Secrets"
I shudder to think how much we've scanned that could be affected by this. Thankfully, I think all of our engineering drawings (which for a decade+ were printed, signed, then scanned when needed for digital issue) were done on a non-xerox device, but all of our standard A3/A4 business stuff is done on Xerox devices.
I suspect Xerox is using this option and their implementation is getting confused (perhaps by the low resolution). Unless I'm greatly mistaken, the patch size for normal compression shouldn't figure here.
It's particularly absurd in this case since it's clearly not easy to learn that lossy compression is being applied or how one would disable it if they wanted their Xerox to work like every other copier/fax they've used.
I spent half a decade on document imaging in the early to mid '90s, a fair amount close to this level (had a coworker who loved bit level stuff for the truly evil problems like this), and I can see how it happened ... given sufficiently careless developers.