Hacker News new | past | comments | ask | show | jobs | submit login

PDF is, without a doubt, one of the worst file formats ever produced and should really be destroyed with fire... That said, as long as you think of PDF as an image format it's less soul destroying to deal with.



PDF is good at what it's supposed to be good. Parsing pdf to extract data is like using a rock as a hammer and a screw as a nail, if you try hard enough it'll eventually work but it was never intended to be used that way.


I think my fastener analogy would probably involve something more like trying to remove a screw that's been epoxied in. Or perhaps trying to do your own repairs on a Samsung phone.

It's not that the thing you're trying to do is stupid. It's probably entirely legitimate, and driven by a real need. It's just that the original designers of the thing you're trying to work on didn't give a damn about your ability to work on it.


Actually, parsing text data from a pdf is more like using the rock to unscrew a screw, in that it was not meant to be done that way at all. But yeah, the pdf was designed to provide a fixed-format document that could be displayed or printed with the same output regardless of the device used.

I'm not sure (I haven't thought about it a lot) that you could come up with a format that duplicates that function and is also easier to parse or edit.


It's closer to using a screwdriver to screw in a rock. The task isn't supposed to be done in the first place but the tool is the least wrong one.


I would think any word processing document format would duplicate that function and be better.


It's pretty silly when you think about it. There's an underlying assumptions that you'll work with the data in the original format that you used to make the PDF.


“PDF is good at what it's supposed to be good.”

QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript.

https://en.wikipedia.org/wiki/PostScript

(And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.)

What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support.

https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov...

Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew?

i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard.

(As for who’s to blame for that, I leave others to join up the dots.)


PDF is great what it meant to be, a digital printed paper, with its pros (It will look exactly the same anywhere) and cons (Can't easily extract data from it or modify it).

Currently, there is no viable alternative if you want the pros but not the cons


For me, the biggest con of PDFs is that like physical books, the font family and size cannot be changed. This means you can't blow the text up without having to scroll horizontally to read each line or change the font to one you prefer for whatever reason. It boggles my mind that we accept throwing away the raw underlying text that forms a PDF. PDF is one step above a JPEG containing the same contents.


> Currently, there is no viable alternative if you want the pros but not the cons

I remember OpenXPS being much easier to work with. That might be due to cultural rather than structural differences, mind - fewer applications generate OpenXPS, so there's fewer applications to generate them in their own special snowflake ways.


This is the first time I heard of it. When I search for it I only find the Wikipedia article and 99 links to how to convert it to pdf.

The problem with this is that from an average person perspective it doesn't have the pros. There is no built-in or first-party app that can open this format on Mac and Linux. More than 99% of the users only want to read or print it. It's hard to convince them to use an alternative format when it's way more difficult to do the only thing they want to do.


It's a Windows-thing, since W7, IIRC. It's ok now, but it has been buggy for years, and yes, who eats xps-files, so better it is, but it's not more useful.


It was too late and probably too attached to Microsoft to succeed. It is still used as the spool file format for modern printer drivers on Windows.


Screenshots of Smalltalk. (I'm joking.)


We have to fill existing PDFs from a wide range of vendors and clients. Our approach is to raster all PDFs to 300DPI PNG images before doing anything with them.

Once you have something as a PNG (or any other format you can get into a Bitmap), throwing it against something like System.Drawing in .NET(core) is trivial. Once you are in this domain, you can do literally anything you want with that PDF. Barcodes, images, sideways text, html, OpenGL-rendered scenes, etc. It's the least stressful way I can imagine dealing with PDFs. For final delivery, we recombine the images into a PDF that simply has these as scaled 1:1 to the document. No one can tell the difference between source and destination PDF unless they look at the file size on disk.

This approach is non-ideal if minimal document size is a concern and you can't deal with the PNG bloat compared to native PDF. It is also problematic if you would like to perform text extraction. We use this technique for documents that are ultimately printed, emailed to customers, or submitted to long-term storage systems (which currently get populated with scanned content anyways).


You could probably reduce file size by generating your additions as a single PDF, and then combining that with the original 'form', using something like

pdftk form.pdf multibackground additions.pdf output output.pdf


> No one can tell the difference between source and destination PDF unless they look at the file size on disk.

Not even when they try to select and copy text?


You can add PDF tag commands to make rasterised text selectable and searchable, though they probably aren't doing that.


Any recommended library for .NET to extract text by coordinates?


There's itext7 (also for java). Not sure how it compares with other libraries, but it will parse text along with coordinates. You just need to write your own execution strategy to parse how you want.

From my experience, it seems to grab text just fine, the tricky part is identifying & grabbing what you want, and ignoring what you don't want... (for reasons mentioned in the article)

https://github.com/itext/itext7-dotnet

https://itextpdf.com/en/resources/examples/itext-7/parsing-p...


I don't know that this could exist for all PDFs.

Sounds like you are in need of OCR if you want to be able to use arbitrary screen coords as a lookup constraint.


Lots of people doing their daily jobs are not aware of the information loss that occurs whenever they are saving/exporting as PDF.


In the consulting industry I’ve seen PDF being used precisely because third parties couldn’t mess with the content anymore.


Yes, the company I once worked for used to supply locked PDF copies to make it slightly harder for casual readers to re-use / steal our text.


That’s the approach I’m using to reformat “reflow” PDFs for mobile in my app https://readerview.app/


The first link on your demo gives me an error (mobile safari) https://www.appblit.com/pdfreflow/viewdoc?url=http://arxiv.o...


I have been waiting for this for so long. It really works, well done.


Tell that to the entire commercial print industry, where they work very well.


Yup. I still have PTSD from a project where I needed to extract text from millions of PDFs


What alternative do you propose? Postscript?


Why not, .ps.gz works pretty well.


... and is much more difficult to extract text from than PDF, given that it's turing complete (hello halting problem) and doesn't even restrict your output to a particular bounding box.


It was never meant to be a data storage format. It's for reading and printing.


Except it sucks for reading?


I haven't experienced problems reading articles and books in PDF format on my phone.


I read ebooks on my Nintendo DSi for several years when I was in college; The low-resolution screen combined with my need for glasses (and dislike of wearing them) made reading PDF files unbearable. Later on I got a cheap android tablet and reading PDF files was easier, but still required constant panning and zooming. Today I use a more modern device (2013 Nexus 7 or 2014 NVidia Shield), and I still don't like PDF files. I usually open the PDF in word if possible, save it in another format, then convert to epub with calibre, and dump the other formats.

Epubs in comparison are easy, as all it takes is a single tap or button press to continue. When there's no DRM on the file (thanks HB, Baen) I read in FBReader with custom fonts, colors, and text size. It doesn't hurt any that the epub files I get are usually smaller than the PDF version of the same book.

Personally, I think the fact that Calibre's format converter has so many device templates for PDF conversion says a lot.


Try being visually impaired.


I have a doubt. What am I missing?


You clearly haven't ever worked with MP3.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: