So, reading the article is a bit weird. It's clear there's an anti-PDF bias from the start, with the implicit assumption that everybody hates reading PDF files. Actually, I don't because I get to read a well formatted document. They even say that it should only be used as a format for things to be printed, never as a document for people to read on a computer... and yet this is clearly meant to be read once on a screen and not printed out. It also contains a hypertext link to their company that obviously wouldn't work if printed, and they embed it in an iframe, because they expect people to be reading it online.
But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.
It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.
You are not reading a PDF document, you are reading a visual representation constructed by a program which is made by people who tear their hair out.
PDF “specification” is not a specification, it only documents the happy path. It never states that behavior of Acrobat remains the holy truth, but in practice undocumented bug-for-bug compatibility is assumed. (We're talking about most basic, universally supported features here.) If ISO was worth their salt, they would at least try to codify the de facto behavior instead of stamping their name on some Adobe-provided document, then it would be horrible but fixed format. A collection of tests would be nice to have, too.
Of course, this “history” is just a promotional leaflet, which describes the “layman approach” they tried to construct. It's a fault not to mention that PDF was, and still is, a foundation of digital print industry, where big vendors solve compatibility problems for mere mortals, and therefore create unwritten rules of what should and shouldn't work.
It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.
Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".
The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.
Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.
And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).
I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.
I wonder if this explains why trying to copy and paste text out of a PG&E bill would always come back as gobbledygook when I used to receive such bills in the past.
> fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.
Maybe (and, for the fonts, likely), but I don’t think it’s the only reason. Subsetting embedded fonts makes PDFs smaller, often a lot smaller (why embed an entire font because the document uses a single glyph of it as a bullet point? Why would one include Chinese, Japanese, etc glyphs if the document doesn’t use them?)
Even if it’s possible to do that without changing the code point to glyph mapping (is it? I don’t know enough of fonts to answer that), implementing it may be simpler or result in smaller files if one makes the embedded font dense in code points (I tried finding an answer, but soon remembered how complex fonts are, and gave up)
And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.
Actually most pdfs are formatted in a good way and it’s easy to extract text. The stupid stuff is just copy encryption, which is just a stupid feature (because pdf viewers can ignore it)
I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Pdf sometimes has its quirks but the 2.0 version clearly cleans up a lot of the messes
It’s usually easy to extract individual strings from a PDF, normally single lines, but it can be quite hard to understand how those form into longer paragraphs, especially if the page has multiple columns and inline figures.
It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.
I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.
> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.
Is this sarcasm?
AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.
Hasn't it always been that way? Has something changed?
Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).
When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....
There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.
It's true that data is often written out in a logical order, but like you say, that's only because the program that created it was designed that way. I've definitely seen PDF files where tabular data is almost in a logical order but every now and then cells have been jumbled around.
But what I've definitely seen is documents where the characters are deliberately jumbled up and a custom font so that visually everything looks fine. I know this, because there was one specific case where I wanted to extract about 5000 words in a vocabulary list and it was hard to decipher. They'd used several such fonts in the single document as well, so there wasn't a one-to-one mapping of the text encryption. They'd also put a watermark under the list, so you also couldn't easily do OCR of the final screen image either.
To be sure, the content-creator can run riot with the PDF spec and make it suck for everyone but a human reading the screen or printed page. Fortunately I would say 99% of PDFs are much better behaved than that.
Except it's a poorly formatted document because it's not formatted to fit screens of different width, which is huge (phones are a thing)
Also you haven't solved another huge fail of the most basic digital workflow - copy&paste - by pointing at the motivation of the author since "except spaces" ruin it for everyone, not just professional data extractors
After working on PDF document reconstruction for more than a decade, I often fantasized about inventing a cleaner and simpler alternative. After all, there are only three kinds of objects in PDF: shapes, images and glyphs. But it is all those little details that will get you in the end. A line - all you need is a coordinate and a length, right? No: is it solid, what is its width? Is its end point anchored on the left most part of the visible line, or does the thickness spread out from the anchor point? And is the end square of curved? If curved, what are the parameters of the curve? Are both ends the same? On it on it goes. And don't event get me started on glyphs...
PDF is a remarkable creation. It has some notable weaknesses, such as the fact that its color channel for images does not include alpha, and thus needs masks, but the fact that it covers so much visual complexity in a relatively compact form is just amazing. (BTW: Its graphics model is strictly from Adobe Postscript, but PDF content streams are not programs.)
One thing that bugged me while reading this article was the use of the definite article ("the PDF"). Since PDF is an acronym for "Portable Document Format" there may be a grammatical case to be made for the "the", but no one says "the HTML" or "the NASA" and so on.
>In 2020, Nielsen made the case again, writing, “After 20 years of watching users perform similar tasks on a variety of sites that use either PDFs or regular web pages, one thing remains certain: PDFs degrade the user experience.”
Good luck saving a HTML version of any modern web page and being able to read it in twenty or thirty years time. HTML just wasn't designed for that.
An issue with this is that the print CSS of most websites is an afterthought.
While it’s possible to alter the design with @media print as well as the page breaks, few websites do this. You are often left with broken layouts, empty pages, or nonsense page breaks.
For some insane reason, one of the stores we use that you can order online and pickup at the store, when you try to print the page with the barcode, the barcode does not print. We end up having to take a screenshot and print that. It's just utterly baffling, especially for this specific use case.
I tried to make a conference poster with SVG - using Inkscape - and it was minor disaster that rendered differently in different programs/browsers, with some features entirely broken
IBM tried to push a competitor in the 1990s…BookManager was an initially mainframe (VM/CMS, MVS, etc) combination of viewer program and proprietary format. It came about in response to both IBM customers and product documentation groups demanding some sort of online “hypertext” version of the thousands of publications available.
IIRC it came out around the same time as the initial Acrobat format but not necessarily in response to it. Eventually there were viewers for Windows, OS/2. It wasn't particularly bad, but it was very literal in display and Acrobat/PDF rapidly left it in the dust.
When the web boomed in 1995–1996 the product group behind BookManager tried to ban distribution of PDFs by other IBM groups but failed. One of the problems with BookManager formatted files is you had to recreate the appropriate record format if you transferred it back to a mainframe, and I vaguely recall EBCDIC vs ASCII issues (where PDF is, I think, UTF native?).
Microsoft tried with XPS which is a zipped XML format, pretty much like MS Office 2007+ files. To Adobe's credit, they made PDF an open standard around the time XPS came out. Maybe it's a combination of being there first, many files already in PDF, and finally making the format open which made PDF win.
They do say "please don't" as far as their "LC preference"[1] but then later in the document have nice things to say about the format being just .zip and .xml so its introspection and recovery options are much larger than "welp, hope pdf2text still exists in 2040"
1: they have a Recommended Formats Statement: https://www.loc.gov/preservation/resources/rfs/ which currently is published in html and pdf with a "Get Adobe Reader" button on the page, which I feel is dangerously misguided advice
Well, this is exactly why PDF was invented and is doing its job so well. To preserve a desired layout and very specific information on how something has to be outputted.
That comes with downsides, yes, but at its core it's just working fine.
edit: Third option would be to render your content as an image, but that comes with its own downsides.
expect I don't feel many programs are working in PDF natively except for Adobe products. It's always just an export target, or you "print to PDF"
So to me it kinda looks like the format is lacking
I also don't know much about it, but I assume it's not easy to generate programmatically. While for instance generating an SVG diagram/image is generally pretty trivial
PDF is the worst document format, apart from all the other formats. When developing software to read or process PDFs the PDF spec can always deliver a jump scare like no other spec. But to give it credit it broke Microsoft's stranglehold on documents, not completely, but back in the mid 2000s organizations no longer required you to submit things as word documents anymore.
Adobe file formats never had the reputation of being easy to work with. I spent some time with the CFF font format, and I can say it was not a pleasure.
This is a nit for me in the PDF experience. The browsers I use tend to have no difficulty in rendering the PDF in the browser, but every now and again, you click on a PDF and now it's in your Downloads folder and opening up your OSs PDF viewer. 99% of the time, those PDFs are as disposable as HTML pages that I'd rather not manage and intern on my machine.
For everybody complaining about the non-transformative character of PDF: There are several PDF standards out in the wild.
In the graphic industry we mainly use PDF/X files. These are very solid and precise in defining the layout and how objects are rendered.
For archiving purposes there's another standard, it's called PDF/A. Part of PDF/A is that you must be able to transform its text content back to Unicode.
So, if you're looking into being able to convert PDFs back and forth, you should probably use PDF/A. PDF/X files will drop that support to maintain the desired appearance as close as possible.
I would also add that .pdfs are often not meant to be transformed. They are the digital equivalent of a book, which no one complains about not being able to edit. If you wanted to have a document you could edit you don't use .pdf, but something else before you convert export it as an .pdf. The same is true about images. No one complains about .jpg not being editable, as any sane person would use a photoshop or similar file and only export the final product.
PDF/A is a joke of a “standard” that does almost nothing that is promised on the cover. It is just a subset of PDF with limits on variable options like color representation, frozen at some arbitrary point in time, probably because people working with digital archives realized that they couldn't reach the moving goal, and implement the ever growing list of features. We may only expect programs producing PDF/A files to be less “creative”, and produce straightforward markup, but it's not guaranteed at all, because PDF/A doesn't address any of the real core format issues.
> Comments on places like HackerNews refer to it as “one of the worst file formats ever produced” [1], “soul-crushing” [2], and something that “should really be destroyed with fire” [3].
I remember seeing this software on the university's computers, an acrobat.. and I was thinking.. WHAT is that software that I see in EVERY PC? I didn't know what PDF was at the time. I grew up with an Amstrad PC1512 (oh, and a family too) but never had to use Acrobat. Only GW-Basic, Zaxxon, Bubble Bobble, Defender of the Crown, and other super useful software ;)
The most 'complicated' software I used was Volkswriter!
I'm so impressed by the design of this PDF file. It's amazing that they put in so much effort to design what comes down to just an informational article.
I like PDF, It does one thing and it does it well. I just wish there was free alternatives to work with PDFs. I can't believe something so well adopted by everyone is still closely controlled by Adobe.
You can't a free tool that offers features close to Adobe Acrobat, there is none. You have to download multiple tools that each offers their own feature close to Acrobat.
I thought being emancipated from word docs was freedom until I realized the suffering brought on by a left field version of wkhtmltopdf no one uses but someone built and distributed via pamac AUR. In software, Hybridization is Postmodernism.
PDF shares a common property with computational complexity and robotics: the magical world of software is no longer free from physics, as if it ever was. Software has reciprocity in creating these illusions and destroying them.
You can tell the writers and designers hate the PDF format because they made the damn thing so difficult to read, layout-wise. They went 1990s PageMaker/QuarkXpress crazy here.
But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.
It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.