> The PDF format is famously complex. With support for various media types, complicated font rendering and even rudimentary scripting, PDF readers are a common target for vulnerability researchers.
So still no chance in the foreseeable future for this monstrous "paper-based" mockery of docs in a digital age to get phased out?
Paper is still very much a thing in business and office work. PDFs allowing a near perfect translation between computer monitor and paper is an absolutely critical piece of technological infrastructure.
I opened https://en.wikipedia.org/wiki/Open_XML_Paper_Specification and searched for "form field" and got no hits, so if nothing else the IRS couldn't use it. The Licensing section is filled with all kinds of nonsense, but I guess if it's an ECMA standard ... how bad can it be?
Copy paste mostly works fine for me. I only have trouble when it's generated in a weird way (eg. scanned from a paper document then fed through OCR), or has complex formatting (eg. math equations) that have no hope of working correctly in any system. In those cases, I don't see how it's the fault of the PDF format, any more than HTML (or whatever you think is a "real digital document" format) can embed a picture of a scanned document that totally breaks copy-pasting.
How is inserting random line breaks, making it impossible to copy&paste a simple paragraph as a paragraph instead of a bunch of lines "fine"??? This is very common for regular non OCR pdfs, you don't need any math complexity
(but also math equations have plenty of hope even though they're complex indeed, you can copy&paste some kind of "latex" representation that is sometimes used to ... produce those PDFs)
> whatever you think is a "real digital document" format
whatever supports basic digital interaction we've had available to use for many decades in alternative formats, or whatever doesn't have those rigid pre-digital-paper-based layout limitations where you can't use one of your most popular digital devices - your phone - to read a doc since the phone is smaller than a sheet of paper
It's not fine when it happens, but the issue you describe is a property of the PDF viewing application more than the PDF file format (which supports semantic paragraph tags, for example). Adobe Acrobat Reader handles copy & paste well.
Of course Acrobat Reader doesn't handle it well since it's an inherent design flaw of the format despite your trying to deny the obvious. Just tried it - same issue, a paragraph of 3 lines is pasted as 3 lines
> PDF file format (which supports semantic paragraph tags, for example).
These are called newlines and have a pretty widespread support outside of some paper pockets of resistance! You only need some other semantic tags because the format fails at basics
(but don't look at the annual report, that marvel of a public disclosure document not only doesn't copy&paste paragraphs, but has another nice niche use of PDF - you get garbage chars instead of text, rather ironic)
I tried a few documents and got the same result (ie. each line being treated as separate paragraphs), but was able to find that the fed FOMC meeting doc[1] actually worked properly, but only on adobe acrobat. It was still screwed up on pdf.js. So I guess the format itself technically supports it, but implementations rarely do it properly.
The first two work just fine in Adobe Acrobat Reader on iOS. The third is garbage, probably because the producer didn't include a ToUnicode map or equivalent.
The format supports a lot that is not commonly implemented by PDF readers (or PDF producers).
And a good format wouldn't require any ToUnicode maps for simple text in the first place
And poorly supporting a lot without common implementations isn't a defence against the charge of high complexity and bad design, but a reinforcement thereof
(also, no, the first document doesn't work on iOS, I select title and two paragraphs, copy, paste, and I get a single line instead of 3, so a different manifestation of the same common fail of PDFs)
So still no chance in the foreseeable future for this monstrous "paper-based" mockery of docs in a digital age to get phased out?