Hacker News new | past | comments | ask | show | jobs | submit login

PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.

From a machine-parsing perspective PDF files are a nightmare. Chunks of text may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the document in any order.

Spaces may be encoded as spaces, or they may be created a number of other ways, like by positioning chunks, or setting character spacing per character.

The mapping from code point to glyph does not need to be pure Unicode, a PDF document may contain a custom font with additional glyphs.

This is all stuff I learned by trying to parse a limited set of PDFs found in the wild.

All of these gotchas are by the way completely PDF/A compliant.




An alternative thesis - PDFs fit a world view where the highest objective of a document is to be read by a human.

If I am given something that I am personally expected to read thoroughly - be it a report, long-form article, slide deck, etc - then the most professional format by far is a LaTeX PDF.

I can only claim anything beyond personal experience but if I want to signal to someone that a document is important and was written with care then they are getting a PDF.


The thing is, human consumption tend to rely on machine consumption. We want search engines to index our documents, and we want to be able to search within the documents. These features rely on machine parsability.

It is perfectly possible to generate a PDF file with none of the issues mentioned, the problem is that most people don't have the required control of their toolchain, and a lot of tools will create such issues by default.


A LaTeX-generated PDF along with the .tex file used to generate it solves all the problems mentioned by the parent. Now to convince casual users that LaTeX is worth learning... that's a completely different problem.


Not true, I've seen the most abhorrent pdfs generated by latex in academia. When I was working in the digitization department of a public university library we realized that we need to handle pdf just like every other scanned page, rasterize then OCR.


That doesn't make sense. In the case of a PDF+Tex bundle you just run the TeX through Pandoc and you have a neat result. Why would you OCR the PDF when you can just the raw markup?


Because a.) almost no Latex document is published as Pdf+Tex, it's either PDF or die, and b.) latex has a bazillion Turing-complete extensions, that don't make sense semantically until you render them.


Yet, for a visually impaired, PDFs are a nightmare because of what is cited up above.


So PDFs are the Mercedes Benz of document formats?


I worked on a bot that parses NDAs using (among other things) flags, regexes and ML to tell you if you could sign it. In the end from all the files that had to be parsed (doc, docx, txt, rtf and pdf) PDF was the most troublesome. When parsing a PDF nothing was 100% certain.

In the end it was one of the most interesting projects I worked on in my (short) career but sometimes it sucked.


>PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.

I've seen this in academia as well. And, inspired by PoC||GTFO, I've been thinking about downloading academic PDFs, writing a web server that provides an interactive model of the topic in the paper, patching it into the PDF to turn it into a polyglot, and then re-uploading the PDFs. This way people who want their PDFs can have it, those who want something a little more modern can have that simply by interpreting the exact same file as a bash script, and I get to understand the paper by modeling it.


> highest objective of a document is to be printed on paper.

In my line of work, I must lay out information and facts as if they were on paper, in order, to create a specific narrative.

This narrative can not be lost in hyperlinks, other web specific languages. Cases must be laid out in a very specific order from beginning to end, to make my argument, as to why things were done they way they were.

No other medium fits this except paper, or PDF, in a digital sense.

Edit: On websites, content should be replicated in an appropriate format, but most certainty referenced to its original. And, the original should be readily available.


Can you clarify what your line of work is? As it stands I'm unclear on why an HTML document can't represent "cases ... laid out in a very specific order from beginning to end". You don't need to use links or other functionality just because it's there.


Healthcare. Too many documents from many different sources in varying formats, too much time to compile electronically.

It’s such a mess.


>From a machine-parsing perspective PDF files are a nightmare.

Can someone with experience care to explain why? Does it have to do with each letter having an absolute position on the document? I have no clue, to be honest.


You have to essentially render the document yourself in order to figure out what the order of chunks is. Then you might be able to extract content from the chunks you're interested in - or not. A given zipped chunk might be literally anything.

Didier Stevens is the best expert of PDFs that I know of: https://blog.didierstevens.com/programs/pdf-tools/


There's a whole pile of different gotchas.

* identifying characters that won't actually print [white text; zero-size font; beneath another layer; not in printable space]. Once this lead to every letter aappppeeaarriinngg ttwwiiccee..

* text in fonts where a != aa [leaving the text as a complicated substitution cypher; caused by font embedding for only the characters in the document]

* text in images

* no spaces: have to infer from gaps in letters

And these are generated by a whole host of different software with different assumptions, and you never know if there's something else you're missing.


I have no experience in handling raw PDF data, but as a user, I sometimes notice that the computer is not reading the PDF text the same way as I'm reading it. Here are a few examples:

1. When searching (Ctrl+F) a commonly used phrase that occurs multiple times in a PDF, some occurrences fail to show up because of line breaks, accidental hyphenation, etc.

2. Once in a while, I come across PDF files where searches for words containing "fi", "ff", etc. fail because of some insane ligature glyph replacements.

3. Some PDF files that have a two-column layout for text still treat lines across the two columns as one line. Search fails again.


Yeah, pretty much exactly what you said. Since PDF is focused on presentation rather than content it can write text content in any order, and the rules for converting the byte values to unicode values are extremely complex, supporting many different font formats. Some fonts (type 3) don't even include mappings to unicode in some scenarios, instead just encoding the appearance of glyphs and not their meaning.

Assuming you have a reasonable PDF file you have to parse the entire page content stream which includes things like colors, lines, Bezier curves, etc, extract the text showing operations and then stitch the letters back in to words and then words back into reading order, as best you can.

Many sensible PDF producers encode letters and whitespaces reasonably thereby preserving reading order, but this is far from universal.

For an idea of what content stream parsing involves, this is how I currently do it: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad....


It was never really designed for that, only for display and printing. There are some features that let you mark up the text for easier searching and selection, not every producer will use them though.

It is a bit of a dogs dinner now, mostly because of backwards compatibility. XPS is better but obviously failed in the market.


sure... but html is not easier, if not harder. can't bother to get printing correctly either. i will take PDF any given day.


It is easier to generate a good document with HTML, you just have to leave out the bells and whistles.

But of course with the HTML+CSS+JS stack being more a programming language than a document format, there is no bounds to how awful one can make it either.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: