From a machine-parsing perspective PDF files are a nightmare. Chunks of text may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the document in any order.
Spaces may be encoded as spaces, or they may be created a number of other ways, like by positioning chunks, or setting character spacing per character.
The mapping from code point to glyph does not need to be pure Unicode, a PDF document may contain a custom font with additional glyphs.
This is all stuff I learned by trying to parse a limited set of PDFs found in the wild.
All of these gotchas are by the way completely PDF/A compliant.
If I am given something that I am personally expected to read thoroughly - be it a report, long-form article, slide deck, etc - then the most professional format by far is a LaTeX PDF.
I can only claim anything beyond personal experience but if I want to signal to someone that a document is important and was written with care then they are getting a PDF.
It is perfectly possible to generate a PDF file with none of the issues mentioned, the problem is that most people don't have the required control of their toolchain, and a lot of tools will create such issues by default.
In the end it was one of the most interesting projects I worked on in my (short) career but sometimes it sucked.
I've seen this in academia as well. And, inspired by PoC||GTFO, I've been thinking about downloading academic PDFs, writing a web server that provides an interactive model of the topic in the paper, patching it into the PDF to turn it into a polyglot, and then re-uploading the PDFs. This way people who want their PDFs can have it, those who want something a little more modern can have that simply by interpreting the exact same file as a bash script, and I get to understand the paper by modeling it.
In my line of work, I must lay out information and facts as if they were on paper, in order, to create a specific narrative.
This narrative can not be lost in hyperlinks, other web specific languages. Cases must be laid out in a very specific order from beginning to end, to make my argument, as to why things were done they way they were.
No other medium fits this except paper, or PDF, in a digital sense.
Edit: On websites, content should be replicated in an appropriate format, but most certainty referenced to its original. And, the original should be readily available.
It’s such a mess.
Can someone with experience care to explain why? Does it have to do with each letter having an absolute position on the document? I have no clue, to be honest.
Didier Stevens is the best expert of PDFs that I know of: https://blog.didierstevens.com/programs/pdf-tools/
* identifying characters that won't actually print [white text; zero-size font; beneath another layer; not in printable space]. Once this lead to every letter aappppeeaarriinngg ttwwiiccee..
* text in fonts where a != aa [leaving the text as a complicated substitution cypher; caused by font embedding for only the characters in the document]
* text in images
* no spaces: have to infer from gaps in letters
And these are generated by a whole host of different software with different assumptions, and you never know if there's something else you're missing.
1. When searching (Ctrl+F) a commonly used phrase that occurs multiple times in a PDF, some occurrences fail to show up because of line breaks, accidental hyphenation, etc.
2. Once in a while, I come across PDF files where searches for words containing "fi", "ff", etc. fail because of some insane ligature glyph replacements.
3. Some PDF files that have a two-column layout for text still treat lines across the two columns as one line. Search fails again.
Assuming you have a reasonable PDF file you have to parse the entire page content stream which includes things like colors, lines, Bezier curves, etc, extract the text showing operations and then stitch the letters back in to words and then words back into reading order, as best you can.
Many sensible PDF producers encode letters and whitespaces reasonably thereby preserving reading order, but this is far from universal.
For an idea of what content stream parsing involves, this is how I currently do it: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad....
It is a bit of a dogs dinner now, mostly because of backwards compatibility. XPS is better but obviously failed in the market.
But of course with the HTML+CSS+JS stack being more a programming language than a document format, there is no bounds to how awful one can make it either.