> The PDF format is famously complex. With support for various media types, comp...

Dalewyn · 2024-05-20T13:36:18 1716212178

Paper is still very much a thing in business and office work. PDFs allowing a near perfect translation between computer monitor and paper is an absolutely critical piece of technological infrastructure.

pjmlp · 2024-05-20T13:07:34 1716210454

Sure, if you come up with something that covers all PS and PDF use cases.

niutech · 2024-05-20T15:27:49 1716218869

What about OpenXPS (ECMA-388)?

mdaniel · 2024-05-20T15:33:55 1716219235

I opened https://en.wikipedia.org/wiki/Open_XML_Paper_Specification and searched for "form field" and got no hits, so if nothing else the IRS couldn't use it. The Licensing section is filled with all kinds of nonsense, but I guess if it's an ECMA standard ... how bad can it be?

https://ecma-international.org/wp-content/uploads/TC46-XPS-W... and https://ecma-international.org/wp-content/uploads/TC46-XPS-W... are interesting in that they're different packaging of presumably the same data for compare-and-contrast. I will say that exploring .xps files is much easier via $(unzip) than using qpdf or friends

rrr_oh_man · 2024-05-20T13:06:04 1716210364

What's the alternative?

jl6 · 2024-05-20T13:11:39 1716210699

PDF/A. All the good bits of PDF (compatibility, standardization, encapsulation), without the worst bits (media extensions, JavaScript).

(Except PDF/A-4, which reintroduces JavaScript for some horrific reason).

azeemba · 2024-05-20T13:13:55 1716210835

The problem here was neither media extensions nor embedded JavaScript though.

It was pdf.js handling of fonts

agumonkey · 2024-05-20T13:24:13 1716211453

That said, a smaller spec may help people focus on more solid code. Potentially.

bee_rider · 2024-05-20T16:22:25 1716222145

PDF/A requires fonts to be embedded rather than linked, would that have saved the day?

jl6 · 2024-05-20T17:41:15 1716226875

No, /FontMatrix is part of a metadata object which can be present whether the font is embedded or external.

eviks · 2024-05-20T13:37:36 1716212256

The worst bit of PDF since its inception (the worst since it covers the most common use case before media/JS) is that it's not a real digital document as in: simplistic digital things like selection&copy&paste are broken "by design"

gruez · 2024-05-20T14:32:31 1716215551

>is that it's not a real digital document as in: simplistic digital things like selection&copy&paste are broken "by design"

Copy paste mostly works fine for me. I only have trouble when it's generated in a weird way (eg. scanned from a paper document then fed through OCR), or has complex formatting (eg. math equations) that have no hope of working correctly in any system. In those cases, I don't see how it's the fault of the PDF format, any more than HTML (or whatever you think is a "real digital document" format) can embed a picture of a scanned document that totally breaks copy-pasting.

eviks · 2024-05-20T15:59:40 1716220780

How is inserting random line breaks, making it impossible to copy&paste a simple paragraph as a paragraph instead of a bunch of lines "fine"??? This is very common for regular non OCR pdfs, you don't need any math complexity

(but also math equations have plenty of hope even though they're complex indeed, you can copy&paste some kind of "latex" representation that is sometimes used to ... produce those PDFs)

> whatever you think is a "real digital document" format

whatever supports basic digital interaction we've had available to use for many decades in alternative formats, or whatever doesn't have those rigid pre-digital-paper-based layout limitations where you can't use one of your most popular digital devices - your phone - to read a doc since the phone is smaller than a sheet of paper

jl6 · 2024-05-20T17:46:53 1716227213

It's not fine when it happens, but the issue you describe is a property of the PDF viewing application more than the PDF file format (which supports semantic paragraph tags, for example). Adobe Acrobat Reader handles copy & paste well.

eviks · 2024-05-20T18:37:32 1716230252

Of course Acrobat Reader doesn't handle it well since it's an inherent design flaw of the format despite your trying to deny the obvious. Just tried it - same issue, a paragraph of 3 lines is pasted as 3 lines

> PDF file format (which supports semantic paragraph tags, for example).

These are called newlines and have a pretty widespread support outside of some paper pockets of resistance! You only need some other semantic tags because the format fails at basics

jl6 · 2024-05-20T18:57:48 1716231468

Example PDF? Because I tried it too and it worked. Does your PDF use tags?

eviks · 2024-05-20T20:07:56 1716235676

Any PDF from a generic google search?

Here is one from Adobe https://www.adobe.com/support/products/enterprise/knowledgec...

Or even better: their annual investor docs a team of professionals has spent time carefully preparing...

like this https://www.adobe.com/pdf-page.html?pdfTarget=aHR0cHM6Ly93d3...

(but don't look at the annual report, that marvel of a public disclosure document not only doesn't copy&paste paragraphs, but has another nice niche use of PDF - you get garbage chars instead of text, rather ironic)

https://www.adobe.com/pdf-page.html?pdfTarget=aHR0cHM6Ly93d3...

gruez · 2024-05-21T03:47:04 1716263224

I tried a few documents and got the same result (ie. each line being treated as separate paragraphs), but was able to find that the fed FOMC meeting doc[1] actually worked properly, but only on adobe acrobat. It was still screwed up on pdf.js. So I guess the format itself technically supports it, but implementations rarely do it properly.

[1] https://www.federalreserve.gov/mediacenter/files/FOMCprescon...

jl6 · 2024-05-21T08:09:38 1716278978

The first two work just fine in Adobe Acrobat Reader on iOS. The third is garbage, probably because the producer didn't include a ToUnicode map or equivalent.

The format supports a lot that is not commonly implemented by PDF readers (or PDF producers).

eviks · 2024-05-21T08:34:41 1716280481

How does this help me on Windows?

And a good format wouldn't require any ToUnicode maps for simple text in the first place

And poorly supporting a lot without common implementations isn't a defence against the charge of high complexity and bad design, but a reinforcement thereof

(also, no, the first document doesn't work on iOS, I select title and two paragraphs, copy, paste, and I get a single line instead of 3, so a different manifestation of the same common fail of PDFs)

jl6 · 2024-05-26T07:12:41 1716707561

“Here’s a nickel kid, get yourself a better OS”?

Still, the fact that some PDF processors can make this work shows that the format isn’t broken “by design”.

niutech · 2024-05-20T15:22:43 1716218563

OpenXPS (https://en.wikipedia.org/wiki/Open_XML_Paper_Specification), DjVu (https://en.wikipedia.org/wiki/DjVu).

jandrese · 2024-05-20T13:11:38 1716210698

Eps, but it is not much better.

sneed_chucker · 2024-05-20T13:21:24 1716211284

https://xkcd.com/927/