
What's so hard about PDF text extraction? - fagnerbrack
https://filingdb.com/b/pdf-text-extraction
======
totetsu
PDFs are the bane of my existence as someone who relies on machine translation
everyday. The worst is that so many event flyers and things, even important
local government information will just be dumped online as a pdf without there
being any other effort to make the contents available. I don't know how blind
people are supposed participate in civil life here..

~~~
Polylactic_acid
The problem is accessibility features are totally invisible to normal users.
Someone with good intentions creates a pdf, and it works for them. They don't
use screen reader tools or know how they work so they don't even realize there
is a problem.

~~~
exikyut
And the problem with _that_ is that the screen reader tools are $1200, because
of the huge associated R&D costs and incredibly small target market.

The sad thing is that the very complexity required to implement a screen
reader is soley because of the technical nightmare information accessibility
currently is.

It's reasonable to think that "if only" everything could be made ubiquitous
and everyone (= developers) could become collectively aware of these
accessibility considerations, maybe there would be a shift towards greater
semanticity.

Thing is, though, that NVDA is open source, and iOS has put a built-in free
screen-reader in the hands of every iPhone user... and not much has changed.

So it's basically one of those technologies forever stuck in the "initial
surge of moonshot R&D to break new ground and realign status quo" phase. :(

~~~
mikepurvis
How much good would moonshot-level R&D even be capable of doing? Without
realigning the world around a new portable format for printable documents,
isn't this 99% Adobe's problem to solve? Or are the hooks there to make much
more accessible PDFs, and the issue is that various popular generators of
those documents (especially WYSIWYG ones like MS Word) either don't populate
them, or perhaps don't even have the needed semantic data in the first place?

For my part, I would love to see PDFs which can seamlessly be viewed in
continuous, unpaged mode (for example, for better consumption on a small-
screen device like a phone or e-reader). Even just the minimal effort required
to tag page-specific header/footer type data could make a big difference here,
and I expect that type of semantic information would be useful for a screen
reader also.

~~~
roywiggins
Could governments insist that PDF software they buy be screen-reader friendly?
If this were rigorously done, you'd have all government documents be readable
by default, and then anyone else who ran the same software commercially would,
too.

You could also impose requirements on public companies to provide corporate
documents in accessible formats- these sorts of documents are already
regulated.

There's various levers that could be pulled, maybe those aren't the right
ones. But you could do it.

~~~
thayne
The problem is the PDF spec itself is not screen reader friendly.

~~~
myself248
PDF is mostly just a wrapper around Postscript, isn't it?

You could just put the original text in comments or something, wrapped in more
tags to say what it is.

~~~
jjgreen
PDF is a document format, Postscript is a Turing complete programming language
(and rather a fun one IMHO).

------
giovannibonetti
Here where I work we are parsing PDFs with
[https://github.com/flexpaper/pdf2json](https://github.com/flexpaper/pdf2json).
It works very well, and returns an array of {x, y, font, text}.

If you are familiar with Docker, here is how you can add it to your
Dockerfile.

ARG PDF2JSON_VERSION=0.71 RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \ &&
cd $HOME/pdf2json-$PDF2JSON_VERSION \ && wget -q
[https://github.com/flexpaper/pdf2json/releases/download/$PDF...](https://github.com/flexpaper/pdf2json/releases/download/$PDF2JSON_VERSION/pdf2json-$PDF2JSON_VERSION.tar.gz)
\ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null
2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf
$HOME/pdf2json-$PDF2JSON_VERSION \ && cd

~~~
nurettin
Your command is cut off

~~~
saurik
Maybe it was scraped off of a PDF? ;P

In all seriousness, though: it doesn't look like it was cut off; I think the
final cd is just to return the docker builder thingee to the home directory?
The file was already built and installed by that built.

------
axaxs
A lot of people here seem to knock PDF, but I love it. Anyone who has tried to
use OpenOffice full time probably does too. We have 'descriptive text' in MS
various formats, or even html/css. The problem is every implementer does
things in slightly different ways. So my beautiful OpenOffice resume renders
with odd spacing and pages with 1 line of text in MS Office. With PDF,
everyone sees the same thing.

~~~
aikah
> A lot of people here seem to knock PDF, but I love it.

People will disabilities that rely on screen-readers don't love it. There is
no such a problem with HTML/CSS which should be the norm for internet
documents.

> With PDF, everyone sees the same thing.

Yes, provided you can see at first place...

~~~
jacquesm
> There is no such a problem with HTML/CSS which should be the norm for
> internet documents.

It should be. But meanwhile everybody seems to think it is perfectly ok that
there is a bunch of JavaScript that needs to run before the document will
display any text at all and how that text makes it into the document is
anybody's guess.

~~~
ehnto
It was an amazing shift in priorities that I feel like I somehow missed the
discussion for. We went from being worried about hiding content with CSS to
sending nothing but script tags in the document body within 5 years or so. The
only concern we had when making the change seemed to be "but can Google read
it?". When the answer to that became "Uh maybe" we jumped the shark.

My bashful take is that nobody told the rest of the web development world that
they aren't Facebook, and they don't need Facebook like technology. So
everyone is serving React apps hosted on AWS microservices filled in by
GraphQL requests in order to render you a blog article.

I am being hyperbolic of course, but I was taken completely off guard by how
quickly we ditched years of best practices in favour of a few JS UI libraries.

------
oever
Always use PDF/A-1a (Tagged PDF) which contains the text in accessible format.
For many governments this a legal requirement.

With tagged PDF it's easy to get the text out.

~~~
brodo
LaTeX does not support PDF/A btw.

~~~
ejfiskbkkd
Not by default, but the pdfx package enables this, Peter Selinger has a nice
guide:
[https://www.mathstat.dal.ca/~selinger/pdfa/](https://www.mathstat.dal.ca/~selinger/pdfa/)

~~~
ivan_ah
The linked instructions cover PDF/A-1b (have plain text version of contents),
but Tagged PDF is more than that -- it's about encoding the structure of the
document.

There is a POC package for producing Tagged PDF here
[https://github.com/AndyClifton/accessibility](https://github.com/AndyClifton/accessibility)
but it's not a complete solution yet.

Here is an excellent review article that talks about all the other options for
producing Tagged PDFs from LaTeX (spoiler — there is no solution currently):
[https://umij.wordpress.com/2016/08/11/the-sad-state-of-
pdf-a...](https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-
accessibility-of-latex-documents/) via
[https://news.ycombinator.com/item?id=24444427](https://news.ycombinator.com/item?id=24444427)

------
mtippett
Don't get me started on PDF's obsession with nice ligatures.. For the love
god, when you have a special ligature like "ti", please convert the text in
the copy buffer to a "ti" instead of an unpasteable nothing.

Nothing is more annoying than having to manual search a document that has been
exported from PDF and have to make sure you catch all the now incorrect
spellings when all the ligatures have just disappeared "action" -> "ac on",
"finish" -> " nish".

~~~
metafunctor
I'm a bit of a typography buff, so just chiming in to say that there is
nothing inherently wrong with ligatures in PDF!

As far as I understand, PDFs can be generated such that ligatures can be
correctly cut'n'pasted from most PDF readers. I have seen PDFs where ligatures
in links (ending in ”.fi”) cause problems, and I believe that's just an
incorrectly generated PDF; ligatures done wrong.

Considering that PDF a programming language designed to draw stuff on paper,
going backwards from the program back to clean data is not something that one
should expect to always work.

------
ericol
In my case, I think the correct expression would be "what's so hard about
_meaningful_ PDF ext extraction"

My company uses the services of, and has some sort of partnership with, a
company that makes it's business out of parsing CVs.

Recently we've seen a surge in CVs that after parsing return no name and / or
no email, or the wrong data is being fetched (usually, from referees).

So, out of curiosity I took one (So far) pdf and extracted the text with
python.

Besides the usual stuff that is already known (As in, the text is extracted as
found, e.g., if you have 2 columns the lines of text will appear after the
line at the same level in the document that is in the other column) what I
found - obviously take this with a grain of salt as this is all anecdotally so
far - is that some parts of the document have spaces between the characters,
e.g.:

D O N A L D T R U M P

P r e s i d e n t o f t h e U n i t e d S t a t e s o f A m e r i c a

These CVs have the characteristics to be highly graphical. Also anecdotally,
the metadata in the CV I parsed stated it was from Canvas [1]

[https://www.canva.com/templates/EAD7WY_6Ncs-navy-blue-and-
bl...](https://www.canva.com/templates/EAD7WY_6Ncs-navy-blue-and-black-
professional-resume/)

~~~
JKCalhoun
How meaningful the text is is going to depend on how the PDF was generated.

Consider that creating a PDF is generally just the layout software rendering
into a PDF context — no different as far as it is concerned than rendering to
the screen.

Space are not necessary for display (although they might help for text
selection so often are present). It is not important that headers are drawn
first, or footers last — so often these scraps of text will appear in
unexpected places....

PDF has support for screen readers, but of course very few PDFs in the wild
were created with this extra feature.

~~~
Cybiote
You're completely correct but unfortunately, this doesn't matter in practice.
It's true that thanks to formats like PDF/UA, PDFs can have decent support for
accessibility features. Problem is, no one uses them. Even the barest minimum
for accessibility provided by older formats like PDF/A, PDF/A-1a are rarely
used. Heck, just something basic like correct meta-data is already asking for
too much.

This means getting text out of PDFs requires rather sophisticated
computational geometry and machine learning algorithms and is an active area
of research. And yet, even after all that, it will always be the case that a
fair few words end up mangled because trying to infer words, word order,
sentences and paragraphs from glyph locations and size is currently not
feasible in general.

Even if better authoring tools were to be released, it would still take a long
time for these tools to percolate and then for the bulk of encountered
material to have good accessibility.

This recent hn post is relevant: [https://umij.wordpress.com/2016/08/11/the-
sad-state-of-pdf-a...](https://umij.wordpress.com/2016/08/11/the-sad-state-of-
pdf-accessibility-of-latex-documents/)

------
userbinator
_It is not uncommon for some (or all) of the PDF content to actually be a
scan. In these cases, there is no text data to extract directly, so we have to
resort to OCR techniques._

I've also seen a similar situation, but in some ways quite the opposite ---
where all the text was simply vector graphics. In the limited time I had, OCR
worked quite well, but I wonder if it would've been faster to recognise the
vector shapes directly rather than going through a rasterisation and then
traditional bitmap OCR.

~~~
rukuu001
Here's another 'opposite' \- I had to process PDFs to find images in them..
and the PDFs were alternating scans of text + actual images.

------
fareesh
I'm parsing PDFs and extracting tabular data - I am using this library
[https://github.com/coolwanglu/pdf2htmlEX](https://github.com/coolwanglu/pdf2htmlEX)
to convert the PDF into HTML and then parsing thereafter. It works reasonably
well for my use-case but there are all kinds of hacks that I've had to put in
place. The system is about 5-6 years old and has been running since.

The use-case is basically one where there is a tabular PDF uploaded every week
and the script parses it to extract the data. Thereafter a human interacts
with the data. In such a scenario, every ~100 parses it fails and I'll have to
patch the parser.

Sometimes text gets split up, but as long as the parent DOM node is consistent
I can pull the text of the entire node and it seems to work fine.

~~~
funerr
Did you publish your fixes so this could help others too? it seems like this
repo is unmaintained atm.

~~~
fareesh
They are very specific changes for my use-cases unfortunately

------
heresie-dabord
Read all the comments here about people struggling with PDF. All the energy
and code wasted! I have watched this madness for my entire career.

PDF is being used wrongly. For information exchange, data should be AUTHORED
in a parseable format with a schema.

Then PDF should be generated from this as a target format.

There is a solution: xml, xsl-fo.

~~~
crispyambulance

       > There is a solution: xml, xsl-fo.
    

You're right about that, but sadly years of "xml-abuse" in the early naughts
has given xml a bad reputation. So much so that other, inferior, markups were
created like json and yaml. We ain't ever going back.

Meanwhile, pdf just worked-- until you the first time you crack it open and
see what's inside the pdf file. I'll never forget the horror after I committed
to a time-critical project where I claimed... "Oh, I'll just extract data from
the PDF, how bad could it possibly be!"

~~~
heresie-dabord
> xml-abuse

Bad programmers, as usual, frustrated by their own badness.

Today's coders want us to use Jackson Pollock Object Notation everywhere for
everything.

> We ain't ever going back.

Not so, friend. ODF and DOCX are XML. And these formats won't become JPON
anytime soon.

------
Alex3917
> Turns out, much how working with human names is difficult due to numerous
> edge cases and incorrect assumptions

I think a good interview question would be, given a list of full names, return
each person’s first name. It’s a great little problem where you can show
meaningful progress in less than a minute, but you could also work on it full
time for decades.

~~~
paledot
If I got that question in an interview, I would write a program that asks the
user what their first name is. That's the only correct solution to that
problem.

~~~
Alex3917
The fact that it’s unsolvable is what makes it a good interview problem.
Seeing someone solve a problem with a correct solution doesn’t really tell you
anything about the person or their thought process.

~~~
csours
[https://news.ycombinator.com/item?id=24447182](https://news.ycombinator.com/item?id=24447182)

------
oddthink
It seems like it should be doable to train a two-tower model, or something
similar, that simultaneously runs OCR on the image and tries to read through
the raw PDF, that should be able to use the PDF to improve the results of the
OCR.

Does anyone know of any attempt at this?

Blah blah blah transformer something BERT handwave handwave. I should ask the
research folks. :-)

~~~
mcswell
I've thought about it, but haven't tried.

As an experiment, we once tried converting an OCRed dictionary (this one:
[https://www.sil.org/resources/archives/10969](https://www.sil.org/resources/archives/10969))
into an XML dictionary database. (There are probably better ways to get an XML
version of that particular dictionary, but as I say, this was an experiment.)

Despite the fact that it's a clean PDF, and uses a Latin script whose
characters are quite similar to Spanish (and the glosses are in Spanish), the
OCR was a major cause of problems: Treating the upside down exclamation as an
'i', failing to separate kerned characters, confusion between '1' and 'l',
misinterpreting accented characters, and so on and so on. And for some reason
the OCR was completely unable to distinguish bold from normal text, even
though a human could do so standing several feet away.

So I did think of extracting the characters from the PDF. If it had been a
real use case, instead of an experiment, I might have done so.

Write-up here:
[https://www.aclweb.org/anthology/W17-0112/](https://www.aclweb.org/anthology/W17-0112/)

~~~
oddthink
Interesting! We're working on OCR on menu photos, which has some parallels in
structure, but has a much smaller common vocabulary than a dictionary, almost
by necessity. :-)

Many menus are also available in PDF form, so we're trying to figure out if
it's worth bothering with the PDF itself, or if we should just render to image
and thus reduce the problem to the menu-photo one.

------
crazygringo
So interesting to see this just 2 days after this on HN:

"The sad state of PDF-Accessibility of LaTex Documents"

[https://news.ycombinator.com/item?id=24444427](https://news.ycombinator.com/item?id=24444427)

PDF's _have_ accessibility features to make semantic text extraction easy...
but it depends on the PDF creator to make it happen. It's crazy how best-case
PDF's have identifiable text sections like headings... but worst-case is
gibberish remapped characters or just bitmap scans...

------
iav
I have a very similar project where I’ve extracted text and tables from over
1mm PDF filings in large bankruptcy cases - bankrupt11.com

I still haven’t found a good way of paragraph detection. Court filings are
double spaced, and the white space between paragraphs is the same as the white
space between lines of text. I also can’t use tab characters because of
chapter headings and lists which don’t start with a <tab>. I was hoping to get
some help from anyone who has done it before.

~~~
bobbylarrybobby
I imagine looking for lines that end prematurely would get you pretty far. Not
all the way, since some last lines of paragraphs go all the way to the right
margin, but combined with other heuristics it would probably work pretty well,
especially if the page is justified.

~~~
iav
Not a bad idea!

------
Cactus2018
If you are want to extract tabular data from a text-based PDF. Check out the
Tabula project: [https://tabula.technology/](https://tabula.technology/)

The core is tabula-java, and there are bindings for R in tabulizer, Node.js in
tabula-js, and Python in tabula-py.

~~~
iav
I love used tabula and recommend Camelot over it. The Camelot folks even put
together a head to head comparison page (on their website) that shows their
results consistently coming out ahead of Tabula.

My other complaint with Tabula is total lack of metadata. It’s impossible to
know even what page of the PDF the tables are located on! You either have to
extract one page at a time or you just get a data frame with no idea which
table is located on which page.

~~~
nl
The best I've used is PDFPlumber. Camelot lists it on its comparison page[1]
but I've had better results.

Both are better than Tabula though.

[1] [https://github.com/camelot-dev/camelot/wiki/Comparison-
with-...](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-
PDF-Table-Extraction-libraries-and-tools#pdfplumber)

~~~
iav
Thanks - I cannot get Camelot to run in parallel (I use celery workers to
process PDFs), there is some bug in Ghostscript that SEGFAULTS. I’ll try using
PDFPlumber instead! By the way Apache Tika has been the best for basic text
extraction - even outputs to HTML which is neat.

------
dgudkov
I've come up with an idea of PDDF (Portable Data Document Format). The PDF
format allows embedding files into documents. Why not embed an SQLite database
file right in PDF document with all the information nicely structured? The
both formats are very well documented and there are lots of tools on any
platform to deal with them. Humans see the visual part of PDF, while machine
processing works with the SQLite part.

Imagine that instead of parsing a PDF invoice, you just extract the SQLite
file embedded in it, and it has a nice schema with invoice headers, details,
customer details, etc.

Anything else in PDF would work nicely either - vector graphics, long texts,
forms - all can be embedded as an SQLite file in a PDF.

~~~
rasz
Haven't you heard Mozilla foundation? You cant just embed SQLite database!!!1
Its all about those developer aesthetics!
[https://hacks.mozilla.org/2010/06/beyond-html5-database-
apis...](https://hacks.mozilla.org/2010/06/beyond-html5-database-apis-and-the-
road-to-indexeddb/)

~~~
afiori
I think that PDFs and developers aesthetics live on different planets...

(As a side note Mozilla reasoning is that no one knows how to turn SQLite into
a backward compatible standard, not even SQLite developers. So while they
recognise how WebSQL is a fantastic feature it is inadequate for the web
platform the same way it would be to add a python interpreter to the browser)

------
tecoholic
I once took a job for $25 to replace some text on a set of PDFs
programmatically assuming it would probably take an hour max, only to end up
spending 8 hours before I found a decent enough solution. I have never touched
PDF manipulation tasks after that.

------
andrewfong
PDFs are terrible for screen readers for very much all the reasons listed
here.

This article does make me wonder though if we'll ever get to a point where OCR
tech is sufficiently accurate and efficient that screen readers will start
incorporating OCR in some form.

~~~
totetsu
It will have to out-pace the DRM tech that will stop you capturing pixels from
pdfs deemed "protected"

~~~
rosstex
Can you link to something that describes this tech?

~~~
ComputerGuru
There are way too many approaches to enumerate in a reply, but the
StackOverflow answer covers just a few of the approaches I’ve seen used on
Windows:
[https://stackoverflow.com/a/22218857/17027](https://stackoverflow.com/a/22218857/17027)

The GPU approach is considerably harder to work around, fwiw. Still possible,
of course.

~~~
manquer
You will always be able to decode it, today running VM or just Chrome with
head on a box is simple way to bypass these techniques most of the time.

Even if DRM on video output becomes common(HDMI has them unlike VGA),
Ultimately video protocols have to emit analog signal they will have to render
to photons your eye can see, CAM rippers use this. It will be possible always
to decode.[1] [2]

[1] Until neuralink kind of tech becomes mainstream and every content owner
requires the interface to consume the content, and the interface can
biologically authenticate the user .

[2] They can trace a CAM ripper, basis unique IDs embedded watermark etc, but
they can never "stop" them from actually ripping, they can only block/penalize
the legal source the rip originated from.

------
Gedxx
If you are interested in to extract pdf Tables, I recommend you Tabula, here
an example [https://www.ikkaro.net/convert-pdf-to-excel-
csv/](https://www.ikkaro.net/convert-pdf-to-excel-csv/)

~~~
codegladiator
also tetpdf (paid) works really well (used it for extracting transactions from
account statement pdfs) (demo works for 2-3 pages). I actually used a
combination of tabula and tetpdf.

------
pvg
Recently:
[https://news.ycombinator.com/item?id=22473263](https://news.ycombinator.com/item?id=22473263)

------
visarga
Using ML it is possible to parse PDF pages and interpret their layout and
contents. Coupled with summarisation and text to speech it could make PDFs
accessible to blind people. The parsing software would run on the user system
and be able to cover all sorts of inputs, even images and web pages, as it is
just OCR, CV and NLP.

The advantage is that PDFs become accessible immediately as opposed to the day
all PDF creators would agree on a better presentation.

Example: A layout dataset - [https://github.com/ibm-aur-
nlp/PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet)

and: SoundGlance: Briefing the Glanceable Cues of Web Pages for Screen Reader
Users
[http://library.usc.edu.ph/ACM/CHI2019/2exabs/LBW1821.pdf](http://library.usc.edu.ph/ACM/CHI2019/2exabs/LBW1821.pdf)

A related problem is invoice/receipt/menu/form information extraction. Also,
image based ad filtering.

------
rietta
In Linux, pdftotext works pretty well for the basic extraction needs I've had
over the years. I have many times used this extremely simple Ruby wrapper
[https://gist.github.com/rietta/90ae2187606953bee9735c00f3a6e...](https://gist.github.com/rietta/90ae2187606953bee9735c00f3a6e766).

------
tmvnty
Related: Why GOV.UK content should be published in HTML and not PDF
([https://gds.blog.gov.uk/2018/07/16/why-gov-uk-content-
should...](https://gds.blog.gov.uk/2018/07/16/why-gov-uk-content-should-be-
published-in-html-and-not-pdf/))

------
beervirus
Just last week I was trying to extract text from a PDF and it had extra spaces
between all the characters. Eventually in frustration I exported to PNG, made
a new PDF, and OCRed it.

It felt dirty.

~~~
totetsu
pdfsandwich helps make this a one command process

[http://www.tobias-elze.de/pdfsandwich/](http://www.tobias-
elze.de/pdfsandwich/)

~~~
hpfr
This looks quite useful but it appears it hasn’t been updated since 2010, and
I’d imagine there’s been some advancements in this domain in the last decade.
If anyone knows of similar tools for aligning and improving book scans, I’d
love to hear of them. Thanks for this in any case!

------
leeter
I remember a prior boss of mine was once asked if our reporting product could
use PDF as an input. He chuckled and said "No, there is no returning from
chaos"

------
thdrdt
Isn't the reason extraction is almost impossible that everything in a PDF is
placed inside a box? So every line has it's own box and paragraphs don't
exist. Paragraphs are just multiple boxes placed together. The context of
sentences is completely gone.

~~~
joquarky
Basically. The text placement syntax is essentially "put this glyph or string
of glyphs at these coordinates". So it could even be as granular as a textbox
for each individual glyph.

------
nl
I've (partially) done this (for ASX filings, not EU).

God what horrible mess it is. They are right about OCR being the best approach
sometimes, but then there is tables. Tables in PDFs are.. well there is active
academic research into the best way to extracting data.

------
konfuzio
PDF are a pain, and Konfuzio is an AI software to make PDF content machine
readable. On our journey to structure PDF content or even scans we have been
supported by large enterprises in the banking, insurance and audit industry.
We are in closed beta for data scientists, so feel free to request a demo and
free user account.

[http://www.konfuzio.com](http://www.konfuzio.com)

Disclaimer: I am the co-founder of Konfuzio, a start-up founded in 2016 based
in Germany.

------
thayne
Going the other way is terrible too. I had to work on making text in pdfs that
our software generated selectable and searchable. PDF simply doesn't have any
way to communicate the semantics to the viewer, so things like columns,
sidebars, drop letters, etc. Would all confuse viewers. It didn't help that
every pdf viewer uses its own heuristics to figure out how text is supposed to
be ordered. Ironically, Acrobat was one of the worst at figuring out the
intended layout.

~~~
LoSboccacc
I've worked at a html > pdf engine, I don't remember that much issues with
text itself beyond making the pdf metrics the same as html so that linke
breaks occurred on the same places.

did you write the column one after another (first all left, then all right) or
going left right line of text then top down?

~~~
thayne
It was a few years ago, but I think it was all one column then another.

And it worked ok in some viewers, but not others.

------
tyingq
While it still has issues, I have pretty good luck using pdftotext, then in
the cases where the output isn't quite right, adding the --layout or --table
options.

~~~
dunham
Yeah, I've had good luck with pdftotext -layout in the past. And in one case
I've used pdftohtml -xml (plus some post-processing), which is useful if the
layout is complex or you want to capture stuff like bold/italic.

------
eaclarich
Despite all recent OCR, AI and NLP technology, PDFs won't make "computational"
sense unless you know in advance what kind of data you are looking for. The
truth of the matter is that PDF was designed to convey pretty "printed"
information in an era where "eyes only" was all that mattered. Today the PDF
format just can't provide the throughput and reliability that interop between
systems requires.

------
cel1ne
> Copying the text gives: “ch a i r m a n ' s s tat em en t” Reconstructing
> the original text is a difficult problem to solve generally.

Why not looking for stretches characters with spaces between them, then
concatenate, check against a dictionary and if a match is found, remove the
spaces.

> “On_April_7,_2013,_the_competent_authorities”

Same here.

~~~
hrktb
It's doable, but it's only the easiest cases. If it's a one off script, or
some automation you give as a base to a human to review afterwards it can be a
good first step.

If it's supposed to be a somewhat final result, ran against a dataset you have
little control over (people sending you PDFs they made, vs. PDFs coming from a
know automated generator) you'll hit all the other not so edge cases very
fast.

Like, otherwise invisible characters inserted in the middle of your text,
layout that makes no logical sense and puts the text in a weird order but was
fine when it was displayed on the page, characters missing because of weird
typographic optimization (ligatures, characters only in specific embeded fonts
etc.). Basically everything in the article is pretty easy to find in the wild.

------
marksoftwareguy
I built and successfully sold an entire start up around solving all of these
problems and more. I'm interested to learn about projects where I can
contribute my knowledge. Hit me up if you need some help with something
significant. contact details in my about.

------
dmoo
Probably preaching to the choir, but for me step one is to use popper tools.
pdftotext -layout

------
microcolonel
PDF is so bad for representing text that when you tell people about it, they
think you're trolling or wrong.

[https://news.ycombinator.com/item?id=24112836](https://news.ycombinator.com/item?id=24112836)

~~~
tptacek
Representing text isn't PDF's job. PDF is meant to be "virtual paper".

~~~
rlayton2
Yup, the use case of PDFs is "export to PDF and check it immediately before
sending to the printer".

All other use cases of PDF are better served by other formats.

~~~
bscphil
> All other use cases of PDF are better served by other formats.

If that were the case, we would all be using PDF/A, or better still a subset
of that. The fact that PDF has much more stuff in the spec than that suggests
that a large number of people find PDF to be the best format for what they're
trying to do.

~~~
Shared404
> a large number of people find PDF to be the best format for what they're
> trying to do.

Or they just don't know any better. I could certainly see this being the case
for many office workers.

~~~
bscphil
I mean, I'm certainly an advocate for using PDF/A. I agree with you about what
PDF _should_ be used for, but it's hard to imagine the spec got that inflated
even though nobody thought it needed all those features.

~~~
Shared404
As we all know, it is an Adobe technology first and foremost. That could
explain why somebody thought it needed those features, despite the fact that
it really shouldn't.

More clarification of what I mean, in the form of a dangerous conversation:

OfficeCorp Exec: Hey adobe, you know what would be great! Fillable form
fields! And dynamic stuff!.

Adobe Sales: Great! Wonderful feedback, we can do it!

Edit: Fix tone. Or try to. Maybe I'm just overthinking what I say?

------
amai
PDF should really be called UPDF (Un-Portable Document Format). Everybody who
is using PDF for data exchange makes it clear that he/she is either
incompetent or actively trying to make data exchange impossible. It is no
accident, that many governments publish statistics as PDF files (instead of
CSV for example), which makes it very hard to parse and search through the
data.

------
UglyToad
I've written a little overview of the open source options for text extraction
available in C# [https://dev.to/eliotjones/reading-a-pdf-in-c-on-net-
core-43e...](https://dev.to/eliotjones/reading-a-pdf-in-c-on-net-core-43ef)

At some point I need to port PdfTextStripper from PDFBox, it seems to be among
the most reliable libraries for extracting text in a generic way.

------
OliverJones
The same interpretation hassles crop up when trying to extract text from PCL
documents. This happened to me when implementing a SaaS feature allowing
customers to use a printer driver to insert documents into a document
repository.

The weirdest one? an application where the PCL code points used the EBCDIC
character set. For once, being an old-timer helped.

~~~
mcswell
EBCDIC; EBCDIC... Now that's a name I've not heard in a long time.

------
rasz
>PDF read protection

Click print, copy all you want in the print preview window. Brilliant
protection scheme.

------
citizenpaul
I really wish this company had some sort of product. However I spoke with them
and they are essentially just a dev shop that specializes in PDF manipulation.
Just to save other people some time if you think they simply have a product
you can use.

------
amadeuspagel
Since there are so many books available both as PDF and in other formats,
where it's easier to extract the text, shouldn't it be possible to train a
neural network on them?

------
jordache
What I hate the most are those job sites they extract the text from your PDFs.
They are close to 99% failure rate, requiring me to manually tweak the content
they extracted

~~~
Liquix
* Please upload an up-to-date resume detailing your education and job history.

* Please fill out this series of forms on our website detailing your education and job history.

* Congrats, you have been selected for phone screening! Please be prepared to discuss your education and job history with the recruiter.

:'(

------
teraku
I wonder if there is a flag or feature one can set in {major-vendor text
editor} when exporting to PDF to make it as accessible and machine-readable as
possible?

------
kayhi
What software is the current leader in OCR solutions?

~~~
nl
Abbyy FineReader is generally better than any of the major cloud vendors in my
testing.

See also this previous discussion:
[https://news.ycombinator.com/item?id=20470439](https://news.ycombinator.com/item?id=20470439)

~~~
unityByFreedom
Also quite expensive since it is at the head of the pack IIRC. There is
probably some value in making a competitor with new deep learning techniques
provided you have a sufficiently diverse training set. It would take years to
build tho.

~~~
nl
I've done some work on deep learning approaches - specifically table
extraction.

It's not at all obvious how to make this work - there is a lot of human
judgement involved in judging what a header is vs what are values, especially
with merged header column/row columns.

~~~
unityByFreedom
Yeah I remember Abbyy also has an interface to define layouts for this kind of
problem. I.E., this thing is a table and here are the headers etc.

Sorry, I was not trying to say deep learning would be a substitute for all
such issues, just that new approaches may help a smaller team build those
tools more efficiently.

I don't know if Abbyy combines its layout tool with training a model for
customers, but it seems like a reasonable thing to build and expose.

------
bambax
Isn't the solution to render a PDF one wants to extract text from, with a high
dpi, and then OCR the result?

~~~
mcswell
Have you ever looked at the output of an OCR engine? Character (T) at position
1.03409 38.953934 Character (h) at position 9.89402 38.927831 etc. And
hypothesized line, para etc. boundaries that (at least in my experience) might
have come from a Ouija board.

------
jrootabega
"What's so hard about PDF text extraction?" \- if Weird Al ever wrote an Elvis
Costello parody

------
martingoodson
At our firm, evolution.ai, we prefer always to read directly from pixels.
Gives you much more dependable results, for the reasons layed out here.

------
johnthescott
excellent text extraction using pdfbox.apache.org

------
person_of_color
Tldr, text can be encoded as plaintext, image or binary.

~~~
vivekseth
There’s a little more nuance than that. Even if text is drawn using plaintext
data there’s no guarantee that the characters/words appear in the correct
order or have the proper white space between them.

~~~
person_of_color
The best method is probably to render the PDF and use OCR.

~~~
liability
Unfortunately that's obnoxiously inefficient if you're trying to run it
through text-to-speech in real time.

------
lowwave
Personally I think PDF text should NOT be extractable. It is used to show
printed document.

