That's right but it's a pity that PDFs by default don't even include their unico...

sketerpot · on Feb 15, 2009

Oh man, don't get me started! I wrote some PDF processing software last year -- it was supposed to extract pinout diagrams from integrated circuit datasheets -- and to extract the text was a ridiculous task. Since characters can be specified in any order, what I had to do was look at their positions on the rendered page and then clump those into words with some overly complicated graph algorithms. Working with PDF should not be this hard.

If I were making a PDF successor, this would be one of the top problems to fix. At least make text selection work properly! And simplify the format a little, would you please? To get any sort of compatibility with my PDF processing program I had to write it as a custom rendering backend for the Poppler PDF engine. It should not be this hard.

stass · on Feb 14, 2009

What do you mean by 'unicode text contents'? Of course, text in PDF could not be pasted, but it's not a format intended for "editing', we have plain text for this. PDF was created to distribute text documents, drawings, and so on in a way that it will look exactly the same everywhere. But unlike postscript, it includes some high-level features like word indexes, protection and so on, so you can search inside PDFs, add notes, place interactive elements - things impossible in other formats at all.

Thus I don't really understand where the problem with format itself is? If you have suggestions regarding features needed, take a part in ISO comittee.

fauigerzigerk · on Feb 14, 2009

I didn't mean pasting into the PDF but copying parts of a PDF in order to paste them elsewhere. This is not deterministically possible because (most) PDF documents do not contain a sequence of letters or words but rather a sequence of painting instructions, which can be different from the order in which the document is read.

Also, the codes used to represent letters refer to fonts not to unicode code points (or other character sets for that matter). So my problem is that extracting text from PDF has to use a heuristic approach that always fails at some point. That's why copy and paste out of PDF documents leads to such strange results sometimes.

The use case I'm talking about is to distribute documents for viewing, printing AND further processing.

The issues I have with the PDF format are not solvable by adding features because the features are already there. PDF documents can contain unicode text and a very large number of other structural information as you have pointed out. I know because I have written PDF software and I have read the spec very carefully top to bottom.

There are two major problems:

* The PDF format is incredibly bloated, difficult to process, and it allows documents to be distributed without deterministically extractable text. (And I don't mean the case where the author deliberately restricts text extraction)

* The widely used tools to create PDF files, by default, do not use the PDF features that would allow deterministic text extraction.

The PDF format is older than the web. Data integration and search were not important tasks when PDF was created in 1992. It was meant for printing and viewing only.

The only way to solve these problems would be to remove features from the PDF spec or to mandate the use of other features. Both would be incompatible with previous versions in a way that is completely unacceptable. That's why I think there has to be a new much simpler format that leaves all the old arcane baggage behind and facilitates reliable processing in addition to viewing and printing.

twopoint718 · on Feb 14, 2009

I have a question that maybe you can answer, having really perused the PDF spec. Is it true that PDF removed the turing-completeness of PostScript?

I had heard something that the main reason for creating PDF was that you couldn't render a single PS page without rendering all the ones preceding it.

fauigerzigerk · on Feb 14, 2009

Yes that's true. There are no loops or gotos and functions have no side effects. PostScript has all of that. But I think it would have been sufficient to remove the possibility of global side effects to support out of order rendering of pages. A purely functional language without global state could've done the trick.

So I suspect there were other reasons as well for removing so many features. Maybe the complexity of interpreters. I don't know.