Hacker News new | past | comments | ask | show | jobs | submit login
What's so hard about PDF text extraction? (filingdb.com)
406 points by fagnerbrack on Sept 14, 2020 | hide | past | favorite | 235 comments

PDFs are the bane of my existence as someone who relies on machine translation everyday. The worst is that so many event flyers and things, even important local government information will just be dumped online as a pdf without there being any other effort to make the contents available. I don't know how blind people are supposed participate in civil life here..

The problem is accessibility features are totally invisible to normal users. Someone with good intentions creates a pdf, and it works for them. They don't use screen reader tools or know how they work so they don't even realize there is a problem.

And the problem with that is that the screen reader tools are $1200, because of the huge associated R&D costs and incredibly small target market.

The sad thing is that the very complexity required to implement a screen reader is soley because of the technical nightmare information accessibility currently is.

It's reasonable to think that "if only" everything could be made ubiquitous and everyone (= developers) could become collectively aware of these accessibility considerations, maybe there would be a shift towards greater semanticity.

Thing is, though, that NVDA is open source, and iOS has put a built-in free screen-reader in the hands of every iPhone user... and not much has changed.

So it's basically one of those technologies forever stuck in the "initial surge of moonshot R&D to break new ground and realign status quo" phase. :(

How much good would moonshot-level R&D even be capable of doing? Without realigning the world around a new portable format for printable documents, isn't this 99% Adobe's problem to solve? Or are the hooks there to make much more accessible PDFs, and the issue is that various popular generators of those documents (especially WYSIWYG ones like MS Word) either don't populate them, or perhaps don't even have the needed semantic data in the first place?

For my part, I would love to see PDFs which can seamlessly be viewed in continuous, unpaged mode (for example, for better consumption on a small-screen device like a phone or e-reader). Even just the minimal effort required to tag page-specific header/footer type data could make a big difference here, and I expect that type of semantic information would be useful for a screen reader also.

I thought flowed text in PDFs was possible, but rarely used because it removes half of the benefit of PDFs (that is, page-exact references and rendering).

There already is a solution, based on html and very easy to deal with: epub. Publishers are gradually shifting to it, even for things like science textbooks, as they realize that PDFs might look pretty but they're a usability nightmare. There are still some stupid things publishers do with epubs, but they'll get it right eventually; meanwhile the main text and font settings can be overridden by ereaders.

It's been a while since I've looked at the spec but I don't remember anything like that.

> Or are the hooks there to make much more accessible PDFs, and the issue is that various popular generators of those documents (especially WYSIWYG ones like MS Word) either don't populate them, or perhaps don't even have the needed semantic data in the first place?

Tagged PDF is a thing, and MS Word supports it. THe problem is the very long tail of programs that generate PDFs and don't support tagged PDF. Even some widely used PDF generators, like pdflatex, don't generate tagged PDF, at least not by default.

In general even trying to copy-paste from pdflatex document is a nightmare.

Could governments insist that PDF software they buy be screen-reader friendly? If this were rigorously done, you'd have all government documents be readable by default, and then anyone else who ran the same software commercially would, too.

You could also impose requirements on public companies to provide corporate documents in accessible formats- these sorts of documents are already regulated.

There's various levers that could be pulled, maybe those aren't the right ones. But you could do it.

To do that you would have to identify a "pdf - the accessible parts" of the spec, or perhaps "pdf - possible accessible layouts" and the implementer would have to stick to that. This might come into conflict with government regulations regarding how particular pdfs should be laid out - that is to say if there was a layout that would break natural screen reading flow it would be inaccessible and thus not allowed by Law 2, but be required by Law 1.

This becomes difficult when writing Law 2 because probably you don't know all variations that can be required by Law 1 (where Law 1 is actually a long list of laws and regulations)

Depending on the legal system of a particular country writing Law 2 might not be actually feasible unless you know what Law 1 entails.

Why not just mandate that the content must be accessible to the blind/deaf/etc and let the implementors figure out how best to make that true? For example some municipalities might just choose to provide alternate formats in addition to PDF and that might be fine for them.

as per my original post, you might not be able to mandate something like that given your legal system.

If I have a law saying that a layout needs to look like X layout and X layout is not achievable if it also needs to accessible then depending on the type of legal system you are in you can say Law Y supersedes all previous laws and requires you to make all PDFs accessible. If a legally mandated layout cannot be made accessible then the closest similar layout that still meets accessibility requirements should be used instead (long descriptions of how closest is determined in legal format would of course follow)

I think this will work in the common law system, but I don't think it would work in a Napoleonic system of law (could be wrong, just think you would need to specify exactly what laws it superseded)

As an example when I was part of the efaktura project for the Danish Government https://en.wikipedia.org/wiki/OIOXML when the law was published mandating that all invoices to the government be sent as oioxml efaktura we ran into the problem that there was a previous law requiring that all telecommunications invoices (or maybe utility invoices, I forget) be sent with information that there was no provision for in the UBL standard that efaktura was based on.

Luckily I had introduced an extension method that we could use to put the extra data in with, but otherwise we would have had two competing laws.

As far as mandating layouts in government PDFS, you normally see that kind of thing in military documents, but laws and bureaucracies are such that there might be any number of rules and regulations that cannot be overwritten in a system by drafting an overarching accessibility bill.

Theoretically this would be covered under federal law, but it needs a lawsuit.

The problem is the PDF spec itself is not screen reader friendly.

The spec isn't, but you can make accessible PDFs. https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov...

Accessible PDFs are also more easily machine readable.

It most certainly is!

ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0) are both compliant with PDF/UA (ISO 14289) - the standard for PDF accessibility

PDF is mostly just a wrapper around Postscript, isn't it?

You could just put the original text in comments or something, wrapped in more tags to say what it is.

PDF is a document format, Postscript is a Turing complete programming language (and rather a fun one IMHO).

No, the ideas are the same, but it follows another implementation concept.

I'm not sure there isn't a middle solution? What about a screenreader emulator? I go to a website and it offers me the ability to upload a pdf, link to a website and then shows it to me like a screenreader would.

There may already be good tools that do this, but until it's super easy to see and "everyone" knows about it, then people won't think, "better just give it a quick look in the screenreader". Obviously a next good step to more adoption, is clear feedback about how to easily fix whatever issues the person is seeing...

This is one of the biggest reasons why I think that HTML is still important for general application development.

HTML isn't perfect, it could be a lot better. But it goes a long way towards forcing developers to put their interface in pure text, and to do the visual layout afterwards.

I think separating content from styling has gone a long way towards improving accessibility, precisely because it makes the accessibility features less invisible. I suspect there are additional gains we could make there, and other ways that HTML could force visual interfaces to be built on top of accessible ones.

> The worst is that so many event flyers and things, even important local government information will just be dumped online as a pdf

Well, they could have uploaded it to some online service that shows the document in the browser as an undownloadable but browsable thingy (I don't even know what to call that, and I am definitely not going give any publicity to the online service by spelling out the name)

One certain online service (the one that lets you look at advertisements and fake new while you talk to your Aunt) is very popular for sharing events. On mobile they go out of their way to disable the ability to copy text. Even with the "copy" app set as my assist app, it seems to block or muck up the scrapping of text from the screen. I have to go to mbasic.thissite.com (which doesn't include all the same content) to get things in plain text. It's a real barrier to my participation in society where there are characters in the text I cant read, and I just want to copy them to a translate app..

It also made me late for a Teams based interview, as I foolishly tried to pass the join link between phones using this webpage, and couldn't copy it on the other end.

I pretty much only access said website in Firefox with the extension that puts it in a jail. The mobile app is useless for anything but posting photos of the kids for the grandparents and aunts to see.

Same app implements seriously dark patterns to get users to install their messenger app. Said chat works in the desktop web browser but not a mobile browser. I shall never install it on my phone. Mobile presents as if you have pending messages even when you don't (inbox empty on the web version).

You can use mbasic.thissite.com and it lets you use chat features from mobile browsers.

On android or at least pixels, go in to the app switcher mode and with the app still on screen, copy the text. This uses OCR to copy and you can even copy text from screenshots.

When I read the first half of your comment, I thought you were cleverly describing and would pivot at the end to suggest HTML and web servers.

I hate it too. At work we have a solution that use PDF text extraction software and the result is sometimes not great. This in turn breaks the feature I’m owning, making it hard to work reliably. Of course users aren’t aware of that, so I’m the one taking the blame :/

There's probably a different format, facebook!

everything is on facebook instead of the real interent, personal pet peeve...

Is there a good library for decoding the object tree in a PDF document?

Depends on your programming language. xpdf does a pretty good job from what I’ve heard.

Pdfplumber for Python

Here where I work we are parsing PDFs with https://github.com/flexpaper/pdf2json. It works very well, and returns an array of {x, y, font, text}.

If you are familiar with Docker, here is how you can add it to your Dockerfile.

ARG PDF2JSON_VERSION=0.71 RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \ && cd $HOME/pdf2json-$PDF2JSON_VERSION \ && wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null 2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \ && cd

Your command is cut off

Maybe it was scraped off of a PDF? ;P

In all seriousness, though: it doesn't look like it was cut off; I think the final cd is just to return the docker builder thingee to the home directory? The file was already built and installed by that built.

A lot of people here seem to knock PDF, but I love it. Anyone who has tried to use OpenOffice full time probably does too. We have 'descriptive text' in MS various formats, or even html/css. The problem is every implementer does things in slightly different ways. So my beautiful OpenOffice resume renders with odd spacing and pages with 1 line of text in MS Office. With PDF, everyone sees the same thing.

> A lot of people here seem to knock PDF, but I love it.

People will disabilities that rely on screen-readers don't love it. There is no such a problem with HTML/CSS which should be the norm for internet documents.

> With PDF, everyone sees the same thing.

Yes, provided you can see at first place...

> There is no such a problem with HTML/CSS which should be the norm for internet documents.

It should be. But meanwhile everybody seems to think it is perfectly ok that there is a bunch of JavaScript that needs to run before the document will display any text at all and how that text makes it into the document is anybody's guess.

It was an amazing shift in priorities that I feel like I somehow missed the discussion for. We went from being worried about hiding content with CSS to sending nothing but script tags in the document body within 5 years or so. The only concern we had when making the change seemed to be "but can Google read it?". When the answer to that became "Uh maybe" we jumped the shark.

My bashful take is that nobody told the rest of the web development world that they aren't Facebook, and they don't need Facebook like technology. So everyone is serving React apps hosted on AWS microservices filled in by GraphQL requests in order to render you a blog article.

I am being hyperbolic of course, but I was taken completely off guard by how quickly we ditched years of best practices in favour of a few JS UI libraries.

This compain can be applied to paper too. PDF is not much more than precise document to be printed or exactly visually presented.

Is PDF not supposed to improve on paper...? I'm rather surprised at the revelation that PDF is not accessible.

PDF is tied to page layout. PDF is a way to digitally describe something that's intended to be printed to paper, on a sheet of a certain size.

And as a format, it's much more sane than, say, Word or Excel.

Even if the focus on "where do I put this glyph" means the original text isn't in there by default.

Yea, but that's a terrible explanation for lack of basic accessibility in 2020. Literally just laziness.

FWIW this is not a technical barrier; it would be absolutely trivial to associated blocks of non-flowed text with the layed out text.

My use is exclusively "this will end up on paper in a minute", so any improvement would be irrelevant.

Why would I want document type that can't even refloat on display size to represent any longer written text that is supposed to be consumed on digital device?

I would ask a blind person.

Adobe Reader has an option to Read Aloud PDF files. I don't know how well or poor that works, but I'm writing this comment just in case you were not aware of that function.

The problem is that you cant parse PDFs reliably. Half of the time they are just bitmap images from a scanner.

Some Word documents are like that too! Can't blame the PDF format if the source material is a bunch of scans.

html/css has a similar problem. A lot of my email is just a collection of images that contain text.

I absolutely hate those, but the HTML spec does require the alt attribute which is actually used in practice quite commonly.

I've never read the HTML spec directly until checking it now to verify your comment. I usually use MDN, which says that the alt attribute is not mandatory. [1]

[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Im...

Non-conforming email should be filtered by the scam filter before it ever gets to your MUA.

What is that email not conforming to?

PDF is digital paper.

You like it for that very quality: immutable, reproducible rendering.

Those who have to extract data from PDFs face nearly the same problem as those who have to deal with paper scans: no reliable structure in the data, the source of truth is the optical recognition, by human or by machine.

I agree, but my argument here is that it's up to the producer. They obviously wanted it to be a digital paper, for some reason, and not a data mining source. We should blame the producers, not the format. It's equivalent to saying it's hard to get the source code from the pesky exe files people distribute, so exe is a mess.

PDF is great for what it was originally designed for: a portable format for instructing printers on how to print a document. The problem is people using it in ways it wasn't designed for. Sharing a PDF of your document is about as useful as sharing an SVG export of your document (actually, an SVG probably has more semantic information). It is a vector image format, not a document format.

> They obviously wanted it to be a digital paper, for some reason, and not a data mining source.

"Because it's pretty" is it. 99% of people don't care about text being a data mining source.

The original goal of PDF was to create documents that could "view and print anywhere" (literally the original tagline of the Acrobat project), substantially the same as how the document creator intended them. What Adobe was trying to solve was the problem of sending someone a document that looked a particular way and when they rendered it on their printer or display, it looked different, e.g. having a different number of pages because subtle font differences caused word-wrapping to change the number of lines and thus the page flow. It wasn't about it being "pretty," it was about having functional differences due to local rendering and font availability. In this regard, the format is an emphatic success.

I do wish they had focused a bit more on non-visual aspects such as screen-reader data, but to say the whole point is "because it's pretty" is a bit uncharitable. The format doesn't solve the problem you wish it solved, but it does solve a problem other than making things "pretty."

Alternatively, "the journal only accepts LaTeX."

I quite like PDFs, but this thread has been an eye-opener.

> Anyone who has tried to use OpenOffice full time probably does too.

I agree, I do too (LibreOffice), but for the opposite reason. Even internally, the font rendering in LibreOffice with many fonts is often quite bad. This is especially noticeable for text inside graphs in Calc.

If I'm going to read something lengthy that's a LibreOffice document, I open it (in LibreOffice), and export it to a PDF. LibreOffice consistently exports beautiful PDFs (and SVG graphs), which tells me that it "knows" internally how to correctly render fonts, just that its actual renderer is quite bad.

Is the renderer dealing with the classic small text glyph hinting problem?

Could be. I'm not sure what the issue is. Firefox, Chrome, and basically every other thing I use works fine.

The audience here is developers and other geeks who get stuck dealing with PDFs. The issue when you read into it is usually about structured data delivered via PDF — which I would wholeheartedly agree is a monstrous and unnecessary misuse of the format.

The other thing that is unfair is assholes who deliver tabular data in PDF format usually don’t want you to have it. When your county clerk prints a report, photocopies it 30 times, crumples it and scans to PDF without OCR, that’s not a file format issue.

Yes, thank you! I have exactly the same feelings, because I like the write in old versions of iWork. With a PDF, I know that whatever I export will look the same for whoever I send it to.

I sometimes see people complain about how PDF sucks because it doesn't look quite the same everywhere (namely, non-Adobe readers), but if you're not doing anything fancy is pretty much does. It is, at minimum, more reliable than any other "open" format I'm aware of, save actual images.

The problem you have is that its likely, increasingly likely, that your CV is the exact document that will next be 'read' by computer rather than a human.

I know something about that area. Today, perhaps a 10th of CVs are sorted and prescreened by software. That fraction will only increase.

As someone who actually used to program in PostScript, I am happy as a clam with PDFs!

There are two issues with parsing them however.

  1) PDF is an output format and was never intended to have the display text be parseable.
  2) PDF is PostScript++, which means that is is a programming language.
     This means that a PDF is also an input description to the output that we
     are all familiar with seeing on a page.

PS I don't know if it is the case anymore, but Macs used to have a display server that handled all screen images in PDF format. That was an optimization from the NeXT display server, which displayed using Display PostScript.

>PDF is PostScript++, which means that is is a programming language.

The big change that came with PDF was removing the programming capabilities. A PDF file is like an unrolled version of the same PostScript file. There is still a residue of PostScript left but in no way can it be described as a programming language.

PDF is absolutely a programming language. It is not a general purpose programming language but a page description language. You are referring to looping constructs and procedures being removed, but a loop does not a language make. Similarly, LaTeX and sed are programming languages.

What features do you think make it a programming language? Because I have spent quite a bit of time working with it and all I can see is a file format.

"A programming language is a vocabulary and set of grammatical rules for instructing a computer or computing device to perform specific tasks."

A programming langauge is not inherently a programming language due to features it contains but due to it being used to program. A program is "a series of coded software instructions to control the operation of a computer or other machine."

In this way, a PDF file embodies a program that performs specific tasks. A PDF file does not contain a general purpose programming language, but it does contain the page description language of the output format that describes what is to be imaged. Then, the PDF program is given to an interpreter that displays the output.

This is the same as a simple program in turtle graphics to display a rectangle, even if no other language feature was used. In such a case, one would say that rectangle was programmed. We would not use the word program in connection with that turtle graphics program, if the rectangle description were not sent to an interpreter that displayed the rectangle.

By that definition notepad is a programming language because it can open and show a text document.

> PS I don't know if it is the case anymore, but Macs used to have a display server that handled all screen images in PDF format. That was an optimization from the NeXT display server, which displayed using Display PostScript.

Quartz! https://en.wikipedia.org/wiki/Quartz_(graphics_layer)#Use_of...

AHA! Thank you.

Ugh, yeah I use LibreOffice for all my internal stuff but I have to keep MS Office installed for editing externally-visible documents, so I can be (somewhat) sure the formatting isn't going to get screwed up.

PDF is very good for what it was designed to do, which is to represent pages for printing. It is not so good for use cases where parsing the text is more important than preserving layout.

Indeed, I often say there's a special place in Hell where there are programmers trying to extract data from PDFs.

The souls who labour there, in life, posted PDFs to websites when HTML would have sufficed.

> With PDF, everyone sees the same thing.

SVG does that too, but it can also have aria tags to improve accessibility and have text that can be extracted much more easily.

> everyone sees the same thing

Mostly. I've seen issues where PDF looked fine on a Mac but not on Windows.

Also, the fact that you see the same thing everywhere is good if you have one context of looking at things - e.g. if everyone uses big screen or if everyone prints the document, that's fine. But reading PDFs on e-book readers or smartphones can be a nightmare.

If the PDF uses a font that the creator neglected to embed, the reader’s system will have to supply the font, which could be a substitute. This is the only case I’ve seen where the PDF did not render exactly the same on all systems.

I like PDF a lot. It's got its drawbacks but the universal format is really helpful for layout-driven stuff. Shrug. It gets it done.

And yet everyone is different, on different devices.

Why should they all see the same thing?

Using PDF here is self serving. It’s actively user-hostile.

Always use PDF/A-1a (Tagged PDF) which contains the text in accessible format. For many governments this a legal requirement.

With tagged PDF it's easy to get the text out.

LaTeX does not support PDF/A btw.

Not by default, but the pdfx package enables this, Peter Selinger has a nice guide: https://www.mathstat.dal.ca/~selinger/pdfa/

The linked instructions cover PDF/A-1b (have plain text version of contents), but Tagged PDF is more than that -- it's about encoding the structure of the document.

There is a POC package for producing Tagged PDF here https://github.com/AndyClifton/accessibility but it's not a complete solution yet.

Here is an excellent review article that talks about all the other options for producing Tagged PDFs from LaTeX (spoiler — there is no solution currently): https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a... via https://news.ycombinator.com/item?id=24444427

It's not automatic, but also not that difficult to support


"The sad state of PDF-Accessibility of LaTex Documents"


It's better format but still less accessible for machine translators. Most translating system don't support PDF or only limited support.

> For many governments this a legal requirement.

As in Healthcare.

I'm confused as to what this comment means. Would you please clarify it?

Is he implying that providing healthcare is a legal requirement for most governments? If so, that seems off-topic.

They are saying that, like many governments, healthcare requires the use of the PDF/A-1a standard.

That is it.

You mean, like many (non-US) governments, (the US) healthcare requires the use of the PDF/A-1a? It's just a somewhat bizarre contraposition: "Like many animals, Helianthus annuus needs water to survive". Wait, but it's a plant, right? "Yes, and it needs water, just like animals".

I just assumed that they meant that there are many healthcare orgs that require this standard.

As (it also is) in healthcare.

I'm surprised no ever uses XDP.

A flattened XFA/XDP PDF (no interactive elements) is more or less a PDF/A.

Do popular tools support creating XDP PDF?

Don't get me started on PDF's obsession with nice ligatures.. For the love god, when you have a special ligature like "ti", please convert the text in the copy buffer to a "ti" instead of an unpasteable nothing.

Nothing is more annoying than having to manual search a document that has been exported from PDF and have to make sure you catch all the now incorrect spellings when all the ligatures have just disappeared "action" -> "ac on", "finish" -> " nish".

I'm a bit of a typography buff, so just chiming in to say that there is nothing inherently wrong with ligatures in PDF!

As far as I understand, PDFs can be generated such that ligatures can be correctly cut'n'pasted from most PDF readers. I have seen PDFs where ligatures in links (ending in ”.fi”) cause problems, and I believe that's just an incorrectly generated PDF; ligatures done wrong.

Considering that PDF a programming language designed to draw stuff on paper, going backwards from the program back to clean data is not something that one should expect to always work.

Seven, billions, ups.

I play (tabletop) RPGs online, we use a simple, free rule system (Tiny Six) and any time that I have to copypaste a specific paragraph of the rules in chat I always discover that there are missing characters (so I have to reread the block and fix it, ... in the end it would be faster to just read it aloud in voice).

Same might happen with scene or room descriptions taken from modules etc.

In case this helps, here is a mapping from Unicode ligature-->ascii for all the ligatures I know of (the ones supported by LaTeX fonts): https://github.com/ivanistheone/arXivLDA/blob/master/preproc...

This is assuming you cleaning up the output of `pdftotext` which in my experience is the best command line tool for extracting plain text.

In my case, I think the correct expression would be "what's so hard about meaningful PDF ext extraction"

My company uses the services of, and has some sort of partnership with, a company that makes it's business out of parsing CVs.

Recently we've seen a surge in CVs that after parsing return no name and / or no email, or the wrong data is being fetched (usually, from referees).

So, out of curiosity I took one (So far) pdf and extracted the text with python.

Besides the usual stuff that is already known (As in, the text is extracted as found, e.g., if you have 2 columns the lines of text will appear after the line at the same level in the document that is in the other column) what I found - obviously take this with a grain of salt as this is all anecdotally so far - is that some parts of the document have spaces between the characters, e.g.:


P r e s i d e n t o f t h e U n i t e d S t a t e s o f A m e r i c a

These CVs have the characteristics to be highly graphical. Also anecdotally, the metadata in the CV I parsed stated it was from Canvas [1]


How meaningful the text is is going to depend on how the PDF was generated.

Consider that creating a PDF is generally just the layout software rendering into a PDF context — no different as far as it is concerned than rendering to the screen.

Space are not necessary for display (although they might help for text selection so often are present). It is not important that headers are drawn first, or footers last — so often these scraps of text will appear in unexpected places....

PDF has support for screen readers, but of course very few PDFs in the wild were created with this extra feature.

You're completely correct but unfortunately, this doesn't matter in practice. It's true that thanks to formats like PDF/UA, PDFs can have decent support for accessibility features. Problem is, no one uses them. Even the barest minimum for accessibility provided by older formats like PDF/A, PDF/A-1a are rarely used. Heck, just something basic like correct meta-data is already asking for too much.

This means getting text out of PDFs requires rather sophisticated computational geometry and machine learning algorithms and is an active area of research. And yet, even after all that, it will always be the case that a fair few words end up mangled because trying to infer words, word order, sentences and paragraphs from glyph locations and size is currently not feasible in general.

Even if better authoring tools were to be released, it would still take a long time for these tools to percolate and then for the bulk of encountered material to have good accessibility.

This recent hn post is relevant: https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a...

Yes, I know that. And I take it a large percentage of the people that uses HN does that. But the question is, does the people that actually uses the files, knows that? The example I'm giving, is of people using PDFs with a certain format because they think the graphic appeal would make them stand out from the crowd in a very important matter (They are trying to land a new job, after all) and of people trying to find the right fit for their empty position. Neither of them know about this; but it seriously affects the outcomes.

It wouldn't be the internet if donald trump wasn't dragged into it somehow

Sorry couldn't help myself. It was totally benign, thought. May be I should had used the boris instead as an example.

It is not uncommon for some (or all) of the PDF content to actually be a scan. In these cases, there is no text data to extract directly, so we have to resort to OCR techniques.

I've also seen a similar situation, but in some ways quite the opposite --- where all the text was simply vector graphics. In the limited time I had, OCR worked quite well, but I wonder if it would've been faster to recognise the vector shapes directly rather than going through a rasterisation and then traditional bitmap OCR.

Here's another 'opposite' - I had to process PDFs to find images in them.. and the PDFs were alternating scans of text + actual images.

I'm parsing PDFs and extracting tabular data - I am using this library https://github.com/coolwanglu/pdf2htmlEX to convert the PDF into HTML and then parsing thereafter. It works reasonably well for my use-case but there are all kinds of hacks that I've had to put in place. The system is about 5-6 years old and has been running since.

The use-case is basically one where there is a tabular PDF uploaded every week and the script parses it to extract the data. Thereafter a human interacts with the data. In such a scenario, every ~100 parses it fails and I'll have to patch the parser.

Sometimes text gets split up, but as long as the parent DOM node is consistent I can pull the text of the entire node and it seems to work fine.

I would encourage you to look into the camelot python module - https://github.com/camelot-dev/camelot. I've worked with this for almost a year and it does work for most conventional tables and pdfs.

Did you publish your fixes so this could help others too? it seems like this repo is unmaintained atm.

They are very specific changes for my use-cases unfortunately

Read all the comments here about people struggling with PDF. All the energy and code wasted! I have watched this madness for my entire career.

PDF is being used wrongly. For information exchange, data should be AUTHORED in a parseable format with a schema.

Then PDF should be generated from this as a target format.

There is a solution: xml, xsl-fo.

We looked at the xml->xsl-fo->pdf route, and decided instead to use dblatex->LaTeX->pdf (starting from a slightly modified DocBook XML, which therefore required additional rules in dblatex). We were very satisfied with the result, and were able to do a lot with LaTeX style sheets, including a substantial change in output format when we moved to a different publisher.

I should add that our documents were not run-of-the-mill: they were grammars, and sometimes included non-Roman fonts (Bangla) and right-to-left text (Arabic and Thaana scripts). A lot of things came together at just the right time, like XeTeX (think Unicode-aware LaTeX) and good Nasta'liq fonts. Most people don't have those problems :-).

At my work, i had to generate PDFs based on the outcome of a workflow. My first thought was to use xml, xsl-fo. I spent 100s of hours trying to get everything to work properly with Apache FOP (mostly a layout issue with the XML stylesheet. It seems really limited) [1]. In the end we went with PrinceXML [2]. Much easier solution.

1. https://xmlgraphics.apache.org/fop/ 2. https://www.princexml.com/

   > There is a solution: xml, xsl-fo.
You're right about that, but sadly years of "xml-abuse" in the early naughts has given xml a bad reputation. So much so that other, inferior, markups were created like json and yaml. We ain't ever going back.

Meanwhile, pdf just worked-- until you the first time you crack it open and see what's inside the pdf file. I'll never forget the horror after I committed to a time-critical project where I claimed... "Oh, I'll just extract data from the PDF, how bad could it possibly be!"

> xml-abuse

Bad programmers, as usual, frustrated by their own badness.

Today's coders want us to use Jackson Pollock Object Notation everywhere for everything.

> We ain't ever going back.

Not so, friend. ODF and DOCX are XML. And these formats won't become JPON anytime soon.

> Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions

I think a good interview question would be, given a list of full names, return each person’s first name. It’s a great little problem where you can show meaningful progress in less than a minute, but you could also work on it full time for decades.

If I got that question in an interview, I would write a program that asks the user what their first name is. That's the only correct solution to that problem.

The fact that it’s unsolvable is what makes it a good interview problem. Seeing someone solve a problem with a correct solution doesn’t really tell you anything about the person or their thought process.

This sounds like the analogue of "I would offer the barometer to the building manager if he tells me how tall the building is."

Okay, but - "programmatically determine the users name" is one of the classic foibles of inexperienced programmers. It's not that it's a hard problem, it's an impossible problem that people shouldn't be attempting, yet somehow still do, not unlike validating an email address with regex.

> it's an impossible problem that people shouldn't be attempting, yet somehow still do

You see exactly this kind of probabilistic algorithm being used every day in huge production apps. E.g. how do you think Gmail shows something like "me .. Vince, Tim" in the left column of your email inbox?

Correctly 100% of the time, as long as all of your contacts have names that match "Firstname Lastname". Google Contacts has separate fields, which enables "Vince" but at the expense of assuming that his full name is "Vince Sato" when he may write it "Sato Vince".

The problem with probabilistic algorithms is that you trade predictability for getting it right more frequently (but not 100% of the time). Eg. I could match common Japanese family names or look for a .jp TLD, but neither of these are guarantees that the family name comes first, and even less does their absence imply that they lead with the first name.

I imagine Google's algorithm is no more sophisticated than:

1. Are they in your contacts? Use their first name from your contacts. 2. Are they a Google user with a profile you can view? Use the first name they provided. 3. Use your locale to make a wild guess.

Right exactly, and that's basically the correct approach. But imho that's a super interesting problem.

True story: a few weeks ago, I was trying to debug my Python program, which was reading PubMed XML. It was choking on 'None' as a first name. If you know Python, you know that it uses the keyword None to represent a null element. Turns out the XML used None where an author didn't have a first name, and the Python library I was using interpreted the string "None" as None rather than the string "None", and this was causing some downstream error where another function was expecting a string. I had to write a special Python function to convert the string "None" into a string (an empty string).

It seems like it should be doable to train a two-tower model, or something similar, that simultaneously runs OCR on the image and tries to read through the raw PDF, that should be able to use the PDF to improve the results of the OCR.

Does anyone know of any attempt at this?

Blah blah blah transformer something BERT handwave handwave. I should ask the research folks. :-)

I've thought about it, but haven't tried.

As an experiment, we once tried converting an OCRed dictionary (this one: https://www.sil.org/resources/archives/10969) into an XML dictionary database. (There are probably better ways to get an XML version of that particular dictionary, but as I say, this was an experiment.)

Despite the fact that it's a clean PDF, and uses a Latin script whose characters are quite similar to Spanish (and the glosses are in Spanish), the OCR was a major cause of problems: Treating the upside down exclamation as an 'i', failing to separate kerned characters, confusion between '1' and 'l', misinterpreting accented characters, and so on and so on. And for some reason the OCR was completely unable to distinguish bold from normal text, even though a human could do so standing several feet away.

So I did think of extracting the characters from the PDF. If it had been a real use case, instead of an experiment, I might have done so.

Write-up here: https://www.aclweb.org/anthology/W17-0112/

Interesting! We're working on OCR on menu photos, which has some parallels in structure, but has a much smaller common vocabulary than a dictionary, almost by necessity. :-)

Many menus are also available in PDF form, so we're trying to figure out if it's worth bothering with the PDF itself, or if we should just render to image and thus reduce the problem to the menu-photo one.

Yes. But. The main problem is that in many cases the visual structure is some highly custom form that is hard to present to the user in text.

And on top of this many times there are no text data in the PDF just a JBIG2 image per page.

So interesting to see this just 2 days after this on HN:

"The sad state of PDF-Accessibility of LaTex Documents"


PDF's have accessibility features to make semantic text extraction easy... but it depends on the PDF creator to make it happen. It's crazy how best-case PDF's have identifiable text sections like headings... but worst-case is gibberish remapped characters or just bitmap scans...

I have a very similar project where I’ve extracted text and tables from over 1mm PDF filings in large bankruptcy cases - bankrupt11.com

I still haven’t found a good way of paragraph detection. Court filings are double spaced, and the white space between paragraphs is the same as the white space between lines of text. I also can’t use tab characters because of chapter headings and lists which don’t start with a <tab>. I was hoping to get some help from anyone who has done it before.

I imagine looking for lines that end prematurely would get you pretty far. Not all the way, since some last lines of paragraphs go all the way to the right margin, but combined with other heuristics it would probably work pretty well, especially if the page is justified.

Not a bad idea!

This won't help you any, but we were doing OCR and then parsing a PDF dictionary. The vertical space between entries was usually enough larger than the vertical space between lines within a dictionary that we could distinguish these. Except when a dictionary entry might (or might not) have been split between columns or pages... Especially problematic were a couple entries that were more than one column or even one page in length.

If you are want to extract tabular data from a text-based PDF. Check out the Tabula project: https://tabula.technology/

The core is tabula-java, and there are bindings for R in tabulizer, Node.js in tabula-js, and Python in tabula-py.

I love used tabula and recommend Camelot over it. The Camelot folks even put together a head to head comparison page (on their website) that shows their results consistently coming out ahead of Tabula.

My other complaint with Tabula is total lack of metadata. It’s impossible to know even what page of the PDF the tables are located on! You either have to extract one page at a time or you just get a data frame with no idea which table is located on which page.

The best I've used is PDFPlumber. Camelot lists it on its comparison page[1] but I've had better results.

Both are better than Tabula though.

[1] https://github.com/camelot-dev/camelot/wiki/Comparison-with-...

Thanks - I cannot get Camelot to run in parallel (I use celery workers to process PDFs), there is some bug in Ghostscript that SEGFAULTS. I’ll try using PDFPlumber instead! By the way Apache Tika has been the best for basic text extraction - even outputs to HTML which is neat.

I've come up with an idea of PDDF (Portable Data Document Format). The PDF format allows embedding files into documents. Why not embed an SQLite database file right in PDF document with all the information nicely structured? The both formats are very well documented and there are lots of tools on any platform to deal with them. Humans see the visual part of PDF, while machine processing works with the SQLite part.

Imagine that instead of parsing a PDF invoice, you just extract the SQLite file embedded in it, and it has a nice schema with invoice headers, details, customer details, etc.

Anything else in PDF would work nicely either - vector graphics, long texts, forms - all can be embedded as an SQLite file in a PDF.

It seems the PDF standard is ahead of you: https://news.ycombinator.com/item?id=24467959 :-)

Libreoffice can export hybrid PDFs containing the whole libreoffice document inside the PDF.

Haven't you heard Mozilla foundation? You cant just embed SQLite database!!!1 Its all about those developer aesthetics! https://hacks.mozilla.org/2010/06/beyond-html5-database-apis...

I think that PDFs and developers aesthetics live on different planets...

(As a side note Mozilla reasoning is that no one knows how to turn SQLite into a backward compatible standard, not even SQLite developers. So while they recognise how WebSQL is a fantastic feature it is inadequate for the web platform the same way it would be to add a python interpreter to the browser)

How do you validate that the machine readable sqlite db has the same content as the human readable?

You can't presumably, you have to take it on trust. It's a neat idea. Getting structure from text dumps of PDFs is no fun.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact