
What's so hard about PDF text extraction? - maest
https://www.filingdb.com/pdf-text-extraction
======
tolmasky
This is why iPhone didn't initially ship with double-tap to zoom for PDF
paragraphs (like it had for blocks on web pages). I know because I was
assigned the feature, and I went over to the PDF guy to ask how I would
determine on an arbitrary PDF what was probably a "block" (paragraph), and I
got a huge explanation on how hard it would be. I relayed this to my manager
and the bug was punted.

Edit: To add a little more color, given that none of us was (or at least
certainly I wasn't) an expert on the PDF format, we had so far treated the bug
like a bug of probably at-most moderate complexity (just have to read up on
PDF and figure out what the base unit is or whatever). After discovering what
this article talks about, it became evident that any solution we cobbled
together in the time we had left would really just be signing up for an
endless stream of it-doesn't-work-quite-right bugs. So, a feature that would
become a bug emitter. I remember in particular considering one of the main use
cases: scientific articles that are usually in two columns, AND also used
justified text. A lot of times the spaces between words could be as large as
the spaces between columns, so the statistical "grouping" of characters to try
to identify the "macro rectangle" shape could get tricky without severely
special-casing for this. All this being said, as the story should make clear,
I put about one day of thought into this before the decision was made to avoid
it for 1.0, so far all I know there are actually really good solutions to
this. Even writing this now I am starting to think of fun ways to deal with
this, but at the time, it was one of a huge list of things that needed to get
done and had been underestimated in complexity.

~~~
Alex3917
> I know because I was assigned the feature, and I went over to the PDF guy to
> ask how I would determine on an arbitrary PDF what was probably a "block"
> (paragraph), and I got a huge explanation on how hard it would be.

The funny thing is that creating a universal algorithm to convert PDFs and/or
HTML to plaintext is probably comparable in difficulty to building level 5
self-driving cars, and would accrue at least as much profit to any company
that can solve it. But there are hundreds of billions of dollars going into
self-driving cars, and like zero dollars going into this problem.

~~~
tolmasky
What are the groups that would benefit most from the PDF-to-HTML conversion?
Who are the customers that would drive this profit? I tried to make those
sentences not sound contentious but unfortunately they do, but I am genuinely
curious about this space and who is feeling the lack of this technology most.

~~~
greycol
Almost any business that has physical suppliers or business customers.

PDF is de-facto standard for any invoicing, POs, quotes, etc.

If you solve the problem you can effectively programmatically deal with
invoicing/payments/ large parts of ordering/dispensing. It's a no brainer to
add it on to almost any financial/procurement software that deals with inter
business stuff.

Any small-medium physical business can probably half their financial
department if you can dependably solve this issue.

~~~
tastyminerals
Yes, most invoices are in PDF but only about 40% of them are native PDF
meaning they are actual documents not scanned images converted to PDFs. There
are are also compound PDF invoices which contain images. So, in order to
extract data from them, one needs not only good PDF parser but an OCR engine
too.

~~~
thaumasiotes
If you're using an OCR engine to understand PDFs that are nothing but a
scanned image embedded in a PDF... what do you need a PDF parser for? You can
always just render an image of a document and then use that.

~~~
tastyminerals
For accuracy and speed. The market SOTA Abbyy is far from being accurate.

~~~
speedplane
> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the
order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract
library. The machine-time on one of the major cloud providers works out to
roughly $0.01 for 100-1000 pages.

~~~
bhanhfo
OCR.space charges only $10 for 100,000 conversions. The quality is good, but
not as good as Abbyy.

------
giovannibonetti
One of the main features of the product I work on is data extraction from a
specific type of PDF. If you want to build something similar these are my
recommendations for you:

\- Use
[https://github.com/flexpaper/pdf2json](https://github.com/flexpaper/pdf2json)
to convert the PDF in an array of (x, y, text) tuples

\- Use a good text parsing library. Regexes are probably not enough for your
use case. In case you are not aware of the limitations of regexes you may want
to learn about Chomsky hierarchy of formal languages.

Here is the section of our Dockerfile that builds pdf2json for those of you
that might need it:

# Download and install pdf2json ARG PDF2JSON_VERSION=0.70 RUN mkdir -p
$HOME/pdf2json-$PDF2JSON_VERSION \ && cd $HOME/pdf2json-$PDF2JSON_VERSION \ &&
wget -q
[https://github.com/flexpaper/pdf2json/releases/download/$PDF...](https://github.com/flexpaper/pdf2json/releases/download/$PDF2JSON_VERSION/pdf2json-$PDF2JSON_VERSION.tar.gz)
\ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null
2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf
$HOME/pdf2json-$PDF2JSON_VERSION \ && cd

~~~
robinhowlett
Thanks for the links - agree about the (x,y,text) callout but other metadata
like font size can be useful too.

Regexes have limitations but I was able them to leverage them sufficiently for
PDFs from a single source.

I parsed over 1 million PDFs that had a fairly complex layout using Apache
PDFBox and wrote about it here:
[https://www.robinhowlett.com/blog/2019/11/29/parsing-
structu...](https://www.robinhowlett.com/blog/2019/11/29/parsing-structured-
data-complex-pdf-layouts/)

~~~
giovannibonetti
Oh, yeah, pdf2json returns font sizes as well. I forgot to mention that.

~~~
pierre
pdf2json font name can be uncorrect sometime as it does only extract them
based on a pre-set collection of fonts. I suggest using this fork that fix it
:

[https://github.com/AXATechLab/pdf2json](https://github.com/AXATechLab/pdf2json)

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a
tendency to no handling some ligature/glyph well, transforming word like
finish to "f nish" sometime (eating the i in this case). pdfminer (python) is
the best solution yet but a thousand time slower....

------
daniel-levin
I’m a contractor. One of my gigs involved writing parsers for 20-something
different kinds of pdf bank statements. It’s a dark art. Once you’ve done it
20 times it becomes a lot easier. Now we simply POST a pdf to my service and
it gets parsed and the data it contains gets chucked into a database. You can
go extremely far with naive parsers. That is, regex combined with
positionally-aware fixed-length formatting rules. I’m available for hire re.
structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve
(eg for when OCR thinks 0 and 6 are the same)

~~~
Quarrelsome
Any tricks for decimal points versus noise? Its a terrifying outcome and all
I've got is doing statistical analysis on the data you've already got and
highlighting "outliers".

~~~
kevin_thibedeau
Change the decimal point in the font to something distinctive before
rasterizing.

------
Iwillgetby
If you upload a pdf to google drive and download it 10 minutes later it will
magically have BY FAR the best OCR results in the pdf. Note my pdf tests were
fairly clean so your experience may not be the same.

I have used Google's fine OCR results to simulate a hacker.

\- Download a youtube video that shows how to attack a server on the website
hackthebox.eu

\- Run ffmpeg to convert the video to images.

\- Run a jpeg to pdf tool.

\- Upload the pdf to google drive.

\- Download the pdf from google drive.

\- Grep for the command line identifiers "$" "#".

\- Connect to hackthebox.eu vpn.

\- Attack the same machine in the video.

~~~
DantesKite
Right? I love the OCR for Google Drive. It's such a useful, hidden feature.

By the way, why do you wait 10 minutes? Is there a signal that the PDF is done
processing?

Or is there just some kind of voodoo magic that seems to happen that just
takes 10 minutes to do?

~~~
Iwillgetby
2 minutes is probably long enough. I did notice that google drive doesn't seem
to like it if you upload a lot of files. I have had files sit and never get
OCR, but I forgot about them so they may have OCR on them now.

Also, I am not aware of a signal when it is done.

~~~
Ididntdothis
You got to love modern software. It may do it or not. It may do it within an
unknowable timeframe. But if it does it, it’s wonderful.

------
Wiretrip
PDF is, without a doubt, one of the worst file formats _ever_ produced and
should really be destroyed with fire... That said, as long as you think of PDF
as an _image format_ it's less soul destroying to deal with.

~~~
lm28469
PDF is good at what it's supposed to be good. Parsing pdf to extract data is
like using a rock as a hammer and a screw as a nail, if you try hard enough
it'll eventually work but it was never intended to be used that way.

~~~
Finnucane
Actually, parsing text data from a pdf is more like using the rock to unscrew
a screw, in that it was not meant to be done that way at all. But yeah, the
pdf was designed to provide a fixed-format document that could be displayed or
printed with the same output regardless of the device used.

I'm not sure (I haven't thought about it a lot) that you could come up with a
format that duplicates that function and is also easier to parse or edit.

~~~
anoncake
It's closer to using a screwdriver to screw in a rock. The task isn't supposed
to be done in the first place but the tool is the least wrong one.

------
sixhobbits
As a meta point, it's really nice to see such a well-written, well-researched
article that is obviously used as a form of lead generation for the company,
and yet no in your face "call to actions" which try to stop you reading the
article you came for and get out your wallet instead.

~~~
jiveturkey
i mean except for the banner at the top and bottom! but yeah, an SEO article
with actual substance, well formatted, not grey-on-grey[1], _no trackers_ [2],
is rare these days.

[1] recently read an SEO post on okta's site. who can read that garbage?

[2] only GA ... which isn't a 3rd-party tracker.

~~~
duckmysick
> GA ... which isn't a 3rd-party tracker.

Why not? It's not self-hosted and results are stored elsewhere.

~~~
jiveturkey
it doesn't correlate across sites by default -- the reasonable definition of a
3rd party tracker. by your definition, everything not complete self-hosted is
a 3rd-party tracker. eg, netlify, which uses server logs to "self"-analyze
would be a 3rd party tracker. it is not self-hosted and the data is stored
elsewhere.

some might add: for the purpose of resale of the data, but I don't think
that's a requirement to be classified as 3rd party tracker. the mere act of
correlation, no matter what you then do with the data, makes you a 3rd party
tracker. in case you think that's just semantics, this is important for GDPR
and the new california law.

you can turn on the "doubleclick" option, which does do said correlation and
tracks you. but that's up to the site to decide. GA doesn't do it by default.

------
dwheeler
The best technique for having a PDF with extractable data is to include the
data within the PDF itself. That is what LibreOffice can do, it can slip in
the entire original document within a PDF. Since a compressed file is quite
small, the resulting files are not that much larger, and then you don't need
to fuss with OCR or anything else.

~~~
wenc
Yes to embedding. In Canada, folks have always been able to e-file tax
returns, but the CRA (Canada Revenue Agency) also has fillable PDF form for
folks who insist on mailing in their returns (with their receipts and stuff so
they don't have to store them and risk losing them).

When you're done filling the form, the PDF runs form validity checks and
generates a _2D barcode_ [1] -- which stores your all field entry data -- on
the first page. This 2D barcode can then be digitally extracted on the
receiving end with either a 2D barcode scanner or a computer algorithm. No
loss of fidelity.

Looks like Acrobat supports generation of QR, PDF417 and Data Matrix 2D
barcodes.[2]

[1] [https://www.canada.ca/en/revenue-
agency/services/tax/busines...](https://www.canada.ca/en/revenue-
agency/services/tax/businesses/topics/corporations/corporation-income-tax-
return/completing-your-corporation-income-tax-t2-return/2d-code.html)

[2] [https://helpx.adobe.com/acrobat/using/pdf-barcode-form-
field...](https://helpx.adobe.com/acrobat/using/pdf-barcode-form-fields.html)

~~~
gruez
>for folks who insist on mailing in their returns (with their receipts and
stuff so they don't have to store them and risk losing them).

The Canadian tax agency offers free storage for whatever receipts you mail
them? Sounds nifty. Does the IRS (or any other tax agency) do this?

~~~
wenc
Just the receipts relevant to the tax return. If you e-file you're responsible
for storing receipts up to 6 years in case of audit. (or something like that)

------
bsdubernerd
It's nice to note how several of these problems already exist in much more
structured document types, such as HTML.

Using white-on-white dark-hat SEO techniques for keyword boosting? Check.
Custom fonts with random glyphs? Check. I didn't see custom encodings (yet).

We try to keep HTML semantic, but google has been interpreting pages to a much
higher level in order to spot issues such as these. If you ever tried to work
on a scraper, you know how it's very hard to get far nowdays without using a
full-blown browser as a backend.

What worries me is that it's going to get massively worse. Despite me hating
HTML/web interfaces, one big advantage for me is that _everything_ which looks
like text is normally selectable, as opposed to a standard native widget which
isn't. It's just much more "usable", as a user, because everything you see can
be manipulated.

We've seen already asm.js-based dynamic text layout inspired by tex with
canvas rendering that has no selectable content and/or suffers from all the OP
issues! Now, make it fast and popular with WASM...

"yay"

~~~
dredmorbius
Hiding page content unless rendered via JS is the darkest dark pattern in HTML
I've noted.

Though absolute-positioning of all text elements via CSS at some arbitrary
level (I've seen it by paragraph), such that source order has no relationship
to display order, is quite close.

------
ethanwillis
I went down a rabbit hole while making a canvas based UI library from
scratch.. and started reading about the history of NeWS, display postscript,
and postscript in general.

I started reading the ISO spec for postscript used in modern PDFs. You can
read it yourself here:
[https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf)

What actually needs to be done to extract text correctly is to be able to
parse the postscript, have a way of figuring out how the raw text.. or the
curves that draw the text.. are displayed (whether they are or not and in
relation to each other) using information that the postscript gives you.

Edit: More than anything I think understanding deeply the class of PDFs you
want to extract data from is the most important part. Trying to generalize it
is where the real difficulty comes from.. as in most things.

------
mkjmkumar
Around couple of years ago I am working on a home project and utilised
Tesseract and Laptonica for OCR. Storage and search HDFS, HBase and SolrCloud
on extracted text. You can find the details here on my website. I was very
impressed with conversion of hand written pdf docs with 90% readable accuracy.
I have named it as Content Data Store(CDS)
[http://ammozon.co.in/headtohead/?p=153](http://ammozon.co.in/headtohead/?p=153)
. Source code is open and you may find steps on installation and how to run
here.
[http://ammozon.co.in/headtohead/?p=129](http://ammozon.co.in/headtohead/?p=129)
[http://ammozon.co.in/headtohead/?p=126](http://ammozon.co.in/headtohead/?p=126)
A short demo
[http://ammozon.co.in/gif/ocr.gif](http://ammozon.co.in/gif/ocr.gif)

I didnot get time to enhance it further but planning to containerize the whole
application. See if you find it useful in its current form.

~~~
hylian
I had a similar problem and ended up using AWS' Textract tool to return the
text as well as bounding box data for each letter, then overlayed that on a UI
with an SVG of the original page, allowing the user to highlight handwritten
and typed text. I plan to open source it so if anyone's interested let me
know.

Not a fan of the potential vendor lock in though, so it's only really suitable
for those in an already AWS environment not worried about them harvesting your
data.

~~~
rwojo
Very interested to see this as I was about to work on the same thing!

------
miki123211
I use a screen reader, so of course some kind of text extraction is how I read
PDFs all the time. There were some nice gotchas I've found.

* Polish ebooks, which usually use Watermarks instead of DRM, sometimes hide their watermarks in a weird way the screen reader doesn't detect. Imagine hearing "This copy belongs to address at example dot com: one one three a f six nine c c" at the end of every page. Of course the hex string is usually much longer, about 32 chars long or so.

* Some tests I had to take included automatically generated alt texts for their images. The alt text contained full paths to the JPG files on the designer's hard drive. For example, there was one exercise where we were supposed to identify a building. Normally, it would be completely inaccessible, but the alt was something like "C:\Documents and Settings\Aneczka\Confidential\tests 20xx\history\colosseum.jpg".

* My German textbook had a few conversations between authors or editors in random places. They weren't visible, but my screen reader still could read them. I guess they used the PDF or Indesign project files themselves as a dirty workaround for the lack of a chat / notetaking app, kind of like programmers sometimes do with comments. They probably thought they were the only ones that will ever read them. They were mostly right, as the file was meant for printing, and I was probably the only one who managed to get an electronic copy.

* Some big companies, mostly carriers, sometimes give you contract templates. They let you familiarize yourself with the terms before you decide to sign, in which case they ask you for all the necessary personal info and give you a real contract. Sometimes though, they're quite lazy, and the template contracts are actually real contracts. The personal data of people that they were meant for is all there, just visually covered, usually by making it white on white, or by putting a rectangle object that covers them. Of course, for a screen reader, this makes no difference, and the data are still there.

Similar issues happen on websites, mostly with cookie banners, which are
supposed to cover the whole site and make it impossible to use before closing.
However, for a screen reader, they sometimes appear at the very beginning or
end of the page, and interacting with the site is possible without even
realizing they're there.

------
tyingq
I almost always have to resort to a dedicated parser for that specific pdf. I
use it, for example, to injest invoice data from suppliers that won't send me
plain text. Always end up with a parser per supplier. And copious amounts of
sanity checking to notify me when they break/change the format.

------
saradhi
I'm an ML engineer, worked as a part time data engineer consultant for a
medical lines/claims extraction company, for 3 years, which majorly involved
in extracting the tabular data from the PDFs and Images. Developer rules or
parsers as such is JUST no help. You end up creating a new rule every time you
miss the data extraction.

With that in consideration, and the existing resources are little help
especially on skewed, blurry, handwritten and 2 different table structure in
the input, I ended up creating an API service to extract tabular data from
Images and PDFs - hosted as
[https://extracttable.com](https://extracttable.com) . We cared it to be
robust, average extraction time on images is under 5 seconds. On top of
maintaining accuracy, A bad extraction is eligible for credit usage refund,
which literally not any service offer it.

i Invite HN users to give it a try and feel free to email
saradhi@extracttable.com for extra API credits for the trail.

~~~
jazzido
Hi, author and maintainer of Tabula
([https://github.com/tabulapdf/tabula](https://github.com/tabulapdf/tabula)).
We've been trying to contact you about the "Tabula Pro" version that you are
offering.

Feel free to reachme at manuel at jazzido dot com

~~~
staticautomatic
Edit: See reply below

Am I reading the repos correctly? It looks like Extractable copied Tabula
(MIT) to its own repo rather than forking it, removed the attribution, and
then tried to re-license it as Apache 2.0. If so, that would be pretty fucked
up.

[https://github.com/tabulapdf](https://github.com/tabulapdf)

[https://github.com/ExtractTable/tabulapro](https://github.com/ExtractTable/tabulapro)

~~~
jazzido
Not really. They import tabula_py, which is a Python wrapper around tabula-
java (the library of which I'm a maintainer).

Still, I would have loved at least a heads up from the team that sells Tabula
Pro. I know they're not required to do so, but hey, they're kinda piggybacking
on Tabula's "reputation".

~~~
wpietri
You're being much more polite here than I would be. Even if it isn't illegal,
what they've done is a giant dick move.

~~~
saradhi
William, the intention of "TabulaPro" is to give the developers a chance to
use a single library instead of switching ExtractTable for images and tabula-
py for text PDFs.

What do you recommend us to do, to not make you feel we made a dick move.

TIA

~~~
wpietri
Well, let me ask a few questions:

Did you ask permission of the original author to use a derived name?

Did you discuss your plan to commercialize the original author's work with the
author? Before starting out?

Since starting a commercial project, how much money have you given to the
original author?

~~~
saradhi
\- No, No, Zero.

"commercialize the original author's work with the author" \- No, but let me
highlight this, any extraction with tabula-py is not commercialized - you can
look into the wrapper too :) or even compare the results with tabula-py vs
tabulaPro.

Copying the TabulaPro description here, "TabulaPro is a layer on tabula-py
library to extract tables from Scan PDFs and Images." \- we respect every
effort of the contributors & author, never intended to plagiarize.

I understand the misinterpretation here is that we are charging for the open-
sourced library because of the name. We already informed author in the email
about unpublishing the library, this morning, I just deleted the project and
came here to mention it is deleted :)

~~~
wpietri
Sorry, Saradhi, I don't think you can reasonably claim there was no intention
to plagiarize. Adding a "pro" to something is clearly meant to suggest it's
the paid version of something. And it's equally clear that "TabulaPro" is
derived from "Tabula".

It may be that you didn't realize that people would see your appropriation as
wrong, although I have a hard time believing that as well given that the
author tried to contact you and was ignored. As they say, "The wicked flee
when no man pursueth."

So what I see here is somebody knowingly doing something dodgy and then
panicking when getting caught. If you'd really like to make amends, I'd start
with some serious introspection on what you actually did, and an honest
conversation with the original author that hopefully includes a proper [1]
apology.

[1] Meaning it includes an explicit recognition of your error and the harms
done, a clear expression of regret, and a sincere offer to make amends. E.g.,
[https://greatergood.berkeley.edu/article/item/the_three_part...](https://greatergood.berkeley.edu/article/item/the_three_parts_of_an_effective_apology)

~~~
wpietri
And I'm going add it's really weird that your answer ("No, No, Zero") is
exactly the same as what the library author said [1] two hours before you
posted. But you do that again without acknowledging the author, and with just
enough format difference that it's not a copy-paste. It's extremely hard for
me to imagine you didn't read what he said before writing that; it's just too
similar.

[1]
[https://news.ycombinator.com/item?id=22483334](https://news.ycombinator.com/item?id=22483334)

------
aasasd
What a glorious format for storing mankind's knowledge. Consider that by now
displays have arbitrary sizes and a variety of proportions, and that papers
are often never printed but only read from screens. To reflow text for
different screen sizes, you _need_ its ‘semantic’ structure.

And meanwhile if you say on HN that HTML should be used instead of PDF for
papers, people will jump on you insisting that they need PDF for precise
formatting of their papers—which mostly barely differ from Markdown by having
two columns and formulas. What exactly they need ‘precise formatting’ for, and
why it can't be solved with MathML and image fallback, they can't say.

People feeling the urge to defend PDF might want to pick up at this point in
the discussion:
[https://news.ycombinator.com/item?id=21454636](https://news.ycombinator.com/item?id=21454636)

~~~
alkonaut
A PDF isn’t for storage it’s for display. It’s the equivalent of a printout.
You don’t delete your CAD drawing or spreadsheet after printing it out.

~~~
aasasd
> _A PDF isn’t for storage it’s for display. It’s the equivalent of a
> printout._

This conjecture would have some practical relevance if I had access to the
same papers in other formats, preferably HTML. Yet I'm saddened time and again
to find that I don't.

In fact, producing HTML _or_ PDF from the same source was exactly my proposed
route before I was told that apparently Tex is only good for printing or PDFs.
I hope that this is false, but not in a position to argue currently.

~~~
alkonaut
But when you access a paper it’s for reading it, correct?

It is worrying if places that are “libraries” of knowledge aren’t taking the
opportunity to keep searchable/parseable data, but it’s no worse than a
library of books.

~~~
aasasd
> _but it’s no worse than a library of books_

That's not my complaint in the first place. The problem is that while we
progressed beyond books on the device side in terms of even just the viewport,
we seemingly can't move past the letter-sized paged format. The format may be
a bit better than books—what with it being easily distributed and with
occasionally copyable text—but not enough so.

I'm not even touching the topic of info extraction here, since it's pretty
hard on its own and despite it also being better with HTML.

~~~
floriol
Yeah, it's better with HTML than with PDF, but it's still pretty terrible...
Use some actually structured data format like XML (XHTML would be good),
because you don't want to include a complete browser just to search for text

------
hgoury
There is a fairly interesting library developed by the Stanford Team behind
[https://www.snorkel.org/](https://www.snorkel.org/) that takes structured
documents, including PDF formatted as tables, and builds a knowledge base:
[https://github.com/HazyResearch/fonduer](https://github.com/HazyResearch/fonduer)

It looks promising for these kinds of daunting tasks

~~~
lwhsiao
One of the co-authors of Fonduer here. Just for reference the original paper
for Fonduer is here:

[https://dl.acm.org/doi/pdf/10.1145/3183713.3183729](https://dl.acm.org/doi/pdf/10.1145/3183713.3183729)

And additional follow-up work on extracting data from PDF datasheets is here:

[https://dl.acm.org/doi/pdf/10.1145/3316482.3326344](https://dl.acm.org/doi/pdf/10.1145/3316482.3326344)

One thing to point out about our library is that while we do take PDF as input
and use it to calculate visual features, we also rely on an HTML
representation of the PDF for structural cues. In our pipeline this is
typically done by using Adobe Acrobat to generate an HTML representation for
each input PDF.

~~~
bhl
What type of visual features are you looking at? I've been trying to find a
web-clipper that uses both visual and structural cues from the rendered page
and HTML, but have no luck finding a good starting point.

~~~
lwhsiao
There are a handful. We looks at bounding boxes to featurize which spans are
visually aligned with other spans. Which page a span is on, etc. You can see
more in the code at [1]. In general, visual features seem to give some nice
redundancy to some of the structural features of HTML, which helps when
dealing with an input as noisy as PDF.

[1]:
[https://github.com/HazyResearch/fonduer/tree/master/src/fond...](https://github.com/HazyResearch/fonduer/tree/master/src/fonduer/features/feature_libs)

------
p0nce
The take-away is that PDF should not be an input to anything.

~~~
mikestew
Except eyeballs and printers, and printers are just an eyeball abstraction.

~~~
hnick
You'd hope so, but some printers run some very finicky software with less
horsepower than your desktop machine so can fall over on complex PDF
structures. I preferred Postscript!

~~~
72deluxe
How does PCL cope with PDFs? Do you know if some conversion happens
beforehand?

~~~
hnick
Not sure if I'm misunderstanding but PCL is another page description language
like PS, some printers can use both depending on the driver.

Most of our Xerox printers spoke Postscript natively, these days more printers
can use PDF. We generally used a tool to convert PCL to PS to suit our
workflow if that was the only option for the file, because being able to
manipulate the file (reordering and applying barcodes or minor text
modifications) was important. Likewise for AFP and other formats. PCL jobs
were rare so I never worked on them personally.

------
jatsign
The relatively small company I work for makes me fill out some forms by hand,
because they receive them from vendors as a PDF. So I print it out, sign it,
and return it to my company by hand.

If someone could make a service that lets you upload a PDF that contains a
form, and then let users fill out that form and e-sign it and collect the
results, and then print them out all at once, it would be great.

It's not a billion dollar idea but there are a lot of little companies that
would save a lot of time using it.

~~~
edent
I use Xournal -
[https://sourceforge.net/projects/xournal/](https://sourceforge.net/projects/xournal/)

It lets me type in to forms - or draw text over them if necessary. Then I
paste in a scan of my signature. Then save as a PDF an email across.

I've been doing this for years. Job applications, mortgages, medical
questionnaires. No one has every queried it.

If you're hand delivering a printed PDF, it's just going to be copy-typed by a
human into a computer. No need to make it too fancy.

~~~
moftz
I used Xournal for a couple years in college. It was perfect in how simple it
was to mix handwritten and typed notes or markup documents. The only thing is
that I wish it had some sort of notebook organization feature. It would have
been nice keeping all of my course notes in one file, broken down by chapter
or daily pages. Instead, I ended up with a bunch of individual xojs that did
the job but made searching for material take longer.

------
Savageman
> By looking at the content, understanding what it is talking about and
> knowing that vegetables are washed before chopping, we can determine that A
> C B D is the correct order. Determining this algorithmically is a difficult
> problem.

Sorry, this is a bit off-topic regarding PDF extraction, but it distracted me
greatly while reading...

I'm pretty sure the intention was A B C D (cut then wash). Not sure why the
author would not use alphabet order for the recipe...

[edit] Sorry, I made it read to a colleague and he mentioned the A B C D
annotations were probably not in the original document. This was not clear at
all for me while reading, and if they are not included it's indeed hard to
find the correct paragraph order.

~~~
shawnz
Even if the ABCD was in the original document, how would the computer figure
out it's supposed to indicate the order?

And of course, even if the letters were there in the original document, it
would be clear to a human that they're incorrect because it doesn't make sense
to wash vegetables after cutting.

------
gnicholas
The article mentions various ways in which text that appears normal can
actually be screwed up inside a PDF. I have found this when running PDFs
through the BeeLine Reader PDF converter that my startup built.

One workaround I've found is that sometimes it helps to "print to PDF" the
original PDF using Preview on Mac. This doesn't fix all the problems, but it
does sometimes fix issues with the input PDF — even though both files appear
identical to the human eye.

Are there any other workarounds or "PDF cleaners" out there? It would be
awesome if there were a web-based service where you could get a PDF de-
gunkified, for lack of a better term.

------
the_french
Is there a tool that works for the limited subset of PDFs generated by Latex?
Do those documents have more structure than the average PDF? Less? It'd be
nice to extract text from scientific articles at least.

~~~
aglionby
I spent some time extracting abstracts from NLP papers (ACL conferences) and
it was mostly straightforward. Using pdfquery to extract PDF -> XML gave each
character as an element, and they were mostly ordered sensibly and grouped
into paragraphs.

However... this didn't work in some cases, mainly with formatted text but
sometimes with PDFs that looked like they were compiled in some nonstandard
way. As a result I ended up chucking the XML structure entirely and
recompiling the text from character-level coordinates. Formatted text was also
an issue, with slightly offset y coordinates from regular characters on the
same line.

I'm not sure I could take this experience and say that extracting _all text_
would be straightforward. Hopefully for most documents the XML is nicely
structured, but I imagine there are many more opportunities for
inconsistencies in how the PDF is generated when thinking about diagrams,
tables etc. rather than just abstracts.

Considered writing up a blog post about my experiences with the above but
imagined that it was far too niche. Code's here [1] if it's of interest.

[1]
[https://gist.github.com/GuyAglionby/4b55d00803710f2e2e9877fd...](https://gist.github.com/GuyAglionby/4b55d00803710f2e2e9877fd18b5a491)

------
milesvp
FYI, redaction from pdf can be similarly difficult. I once was tangentially
involved with a pdf redaction piece of software, and due to many different
issues with pdf, the solution ended up being to create images of the input pdf
draw over the redacted info, then create/overwrite a new pdf that was just a
container for the jpgs. It was the only way to be sure the info wasn’t in the
pdf at all, since it could be all kinds of places and duplicated in
interesting ways. But since you’d be working on the final rendering you could
be sure everything you covered would be covered in the final output. The
biggest challenges after that were related to text extraction, since we wanted
a nice UI where you could select text and the redaction would auto cover text
and use a uniform width based on the heights of all characters in the
redaction. I think more often than we were happy with a user would need to
simply use a bounding box since extracting all the pertinent data related to
the text was so hard.

I walked away from the product over a decade ago since it always seemed like
it’d be trivial for adobe to implement the feature in reader. Though every
couple of years there’s a redaction scandal and I keep wondering how lucrative
the product could have been with some marketing.

------
WalterBright
> our most successful solution was to run OCR on these pages.

That's the most interesting point in the article.

Reminds me of how a friend managed to fix bugs in an assembly source file
written in the original programmer's very own undocumented special language
implemented in the assembler's macro language. He disassembled the resulting
object file, fixed the problems, and checked in the disassembly as the new
source code.

------
UglyToad
The open source project I work on [0] returns the letters, their positions and
other associated information.

We provide support for retrieving words as well as a bunch of different
algorithms for document layout analysis [1]. But like the other commenters
here mention, it's an extremely difficult problem which doesn't have an easy
or general solution.

I was trying to build a custom library on top of the open-source library that
did a bit more processing, multi-column analysis, statistical analysis of
whitespace size, etc. But building something that works for the general case
is difficult enough to be functionally impossible.

Despite that I think the PDF format is well suited to what it is for and there
are very few "implementation mistakes" in the spec itself (no up-front length
for inline image data is the main one, plus accessibility obviously). It's
ultimately become too successful and as a result developers are stuck handling
cases where it's being used for entirely the wrong purpose but I can't see a
way to another format gaining purchase for the correct purpose (perhaps it's
like JavaScript in that way, it has huge adoption because it was first, not
because it does all jobs well).

Perhaps a content-first format which also handles presentation well could gain
a foothold if it came with a shim for PDF viewers and software to use but I
dread to think how much effort that would be.

[0]:[https://github.com/UglyToad/PdfPig](https://github.com/UglyToad/PdfPig)

[1]:[https://github.com/UglyToad/PdfPig/wiki/Document-Layout-
Anal...](https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis)

~~~
staticautomatic
I've also done a lot of work in this space and one thing I don't understand is
why more extraction libraries don't support images as input. If your PDF isn't
layered or OCR'd, it might as well be an image. I've lost count of the number
of times I've downloaded some PDF extraction tool and then had to hack it into
accepting an image.

------
garethl
The open-source Ghostscript [1] can convert simple PDFs to text, while keeping
the layout. I doubt it will handle some of the more complicated cases outlined
in the article though.

I use it quite successfully to turn my bank statements into text, which can
then be further processed.

[1]: [https://www.ghostscript.com/](https://www.ghostscript.com/)

~~~
Jaruzel
I've recently done this. Have scanned over 5,000 documents to PDF, then batch
converted those from PDF to TIFF using Ghostscript, and then Tesseract to OCR
the TIFF and combine both back into a searchable PDF. Tesseract may not be the
worlds best OCR software but it's free and both it and Ghostscript are easy to
automate.

Now all I need is a good front end search system for my document archive.

~~~
72deluxe
How did you scan the documents to PDF? I use a Canon P-208 that has served me
well for many many years (long may it!) and the OCR on that works well.

Does the scanning system you use not do OCR?

I use a Mac and Spotlight does a good job of indexing the files. I think
alternatives for other OSes might be something like Apache Solr?

~~~
Jaruzel
I have a Brother ADS-2700w[1] as my scanner which is network connected. It
scans directly to a network share (SMB, but also supports FTP, nfs etc.) and
outputs as PDF. The PDFs are basically 'dumb' PDFs in that each page of the
PDF is an image all wrapped up inside the PDF container.

So that's where Ghostscript comes in. On a schedule I have a script that picks
up new PDFs in the share, runs them through Ghostscript to create a multipage
TIFF, that TIFF is then given to Tesseract (as it can't handle PDFs natively)
which does the OCR and outputs a nice PDF with searchable text. All very
simple.

The scanning of the pages is very fast, but the scanner takes an _age_ sending
the PDFs over the network - it's ethernet port is only 100mbit/s but to be
honest I just think the CPU inside the scanner is slow. It also doesn't have
enough internal buffer which means you can't scan the next document until the
previous one has completed being sent to the share.

If I hooked the scanner up to USB, then the PC could run the Brother software
which does use OCR - but it's not automatic, all it does is display the PDF
inside Paperport once the scan is complete. For bulk scanning, it's not
workable.

Regarding indexing - I've started looking at Solr, and it might suit my needs.
I was hoping for a visual type search system, where you could see thumbnails
of the PDFs in the results.

\---

[1]
[https://www.brother.co.uk/scanners/ads-2700w](https://www.brother.co.uk/scanners/ads-2700w)

------
hwc
I worked on PDF generating software for years. It's a horrible format that
should never have been approved as an ISO standard.

When in doubt, use plain text. It's a million times better in every way that
counts.

I wish my bank statements and such could be downloaded as plain text files,
instead of massive PDF files that embed another copy of a bunch of typefaces
in each file.

~~~
oddthink
Ugh, this. I still fail to understand how a device from 2019, even a phone,
could show any rendering delay when scrolling to page 200 of a 400 page static
document. I thought PDF was less programmable than PostScript, but there's
still got to be some kind of non-local semantics in there.

~~~
dragonwriter
> I thought PDF was less programmable than PostScript

It's not.

It used to be long ago, but now it has full programmability with JavaScript.

------
Santosh83
Another site that breaks the browser's back navigation. Why do so many sites
do this? Do they imagine they retain user attention for longer if they break
navigation? It's pretty trivial to long-press the back button or just close
the tab and not come back again to your site...

~~~
maest
Hi, author here.

We've taken no intentional action to change the way the back button works - in
fact, I too hate it when websites do that.

Can you PM me with some details about what you're seeing? I'm having issues
reporducing it with my particular setup.

~~~
Santosh83
Good to know! I don't believe PM is possible on hacker news so I hope you
don't mind that I describe some details right here?

My browser is the latest (v73.0.1) Firefox on the latest build of Windows10. I
confirmed the issue with all addons disbled so it is not an addon issue. I
think I know what may be responsbile. When initially I load the page the back
button works as intended for about a second. After that delay the page seems
to load some resources from static.parastorage.com and www.mymobileapp.online.
Once those resources are finished loading the back button does not navigate
back to the HN article on the first press. Have to press once more. So I
presume a script from one of those domains is responsible. Hope this helps!

------
andrewshadura
I had to go through a fair bit of this when writing my Android receipt printer
driver. Parse a PDF print job, detect tables, basic formatting, align text to
grid, reformat for 58 mm paper roll width… and that's when the fun begins,
since every ESC/POS printer makers supports a different dialect or a different
character encoding set, or maybe just one, or maybe there are certain quirks
you have to account for…

I should probably write a blog post on this.

[https://salsa.debian.org/andrewsh/escpos-
android](https://salsa.debian.org/andrewsh/escpos-android)

------
nl
A shout-out for PDF Plumber:
[https://github.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber)

I've done _lots_ of work in this space, including computer vision and ML
approaches, and Tabula[1] which was the gold standard for extraction.

PDF Plumber is better on just about every example I've tried.

[1] [https://tabula.technology/](https://tabula.technology/)

------
kbouck
On a personal project, I had a good experience extracting PDF text using
Tabula[1]. You specify the bounding boxes where desired data is, and it spits
out the content it finds.

It still hits the issues mentioned in this article (surprise spaces appearing
in middle of words, etc)

[1] [https://tabula.technology/](https://tabula.technology/)

~~~
squaresmile
There's also camelot in Python [1]. Discovered it on HN [2]. Still a decent
amount of manual work afterwards though but it's probably unreasonable to
expect otherwise.

[1] [https://camelot-py.readthedocs.io/en/master/](https://camelot-
py.readthedocs.io/en/master/)

[2]
[https://news.ycombinator.com/item?id=18199708](https://news.ycombinator.com/item?id=18199708)

~~~
peterburkimsher
I've had a good experience with Camelot extracting table-based data!

------
anodyne33
Does anybody regularly use Acrobat's text extraction engine? I've had fine
results as far as accuracy goes when compared to other OCR engines but one
sticking point drives me nuts. My problem is, and I'm typically doing this in
batches of thousands of files, if a PDF has a footer applied Acrobat sees that
as renderable text and blows off the rest of the rest of the page. I've tried
all manner of sanitizing, removing hidden information, saving as another PDF
protocol and still can't get around the plain text footers/headers. In a
perfect world I'd have unlimited Tesseract or ABBYY access but we're trying to
do this on the cheap and I'm working with client data that I don't want to
bang through Google. I'll have to poke at some of the open source tools
mentioned so far, too.

~~~
philipkglass
14 years ago I used the personal edition of Abbyy FineReader to OCR about
400,000 scanned journal articles. It took me a few months.

The workflow was:

\- Extract the page images as TIFF, and store the page ranges so I could map
the page ranges back to the individual articles afterward.

\- Concatenate a range of images one big file, with an upper limit of (IIRC)
about 4000 pages. FR would start to generate weird errors when I made the
files any bigger than this.

\- Run OCR over the giant 4000 page file.

\- Export the result as one big PDF with OCR text layer under the scanned
pages.

\- Split the PDF back into individual PDF files corresponding to articles,
using the data I saved in step 1.

\- Optimize the individual PDF article files for compact storage, using the
Multivalent [1] optimizer.

I did this with a combination of FineReader -- the only paid software --
Python, Multivalent, AutoHotKey, and PDFtk.

I was living on a grad student stipend at the time so I optimized for spending
the least amount of cash possible, at the cost of writing my own automation to
replace the batch processing found in more expensive editions of FineReader.

The most time consuming part was dealing with weird one-off errors thrown by
FR's OCR engine. I had to resolve them all manually. They were too varied and
infrequent to be worth automating away.

I tried Acrobat's own OCR too before I resorted to FineReader, but it was
pretty terrible. At the time it also appeared to make the PDF files
significantly larger, which was weird since a text layer shouldn't take much
additional storage.

[1] [http://multivalent.sourceforge.net/](http://multivalent.sourceforge.net/)

------
BashiBazouk
It's interesting to see other views of PDF. As someone who lives in
Illustrator ripping every little piece of data out of a pdf to import into an
Illustrator or InDesign file and then making a production pdf for large format
printing and fixing plenty of issues along the way I find the text almost
inconsequential to the whole thing. It's just another element among many
elements: images, vector illustrations, et. PDF might not the best way to pass
along pure text but as a container for graphical representation it works
pretty well. I build pdf files to describe 20 ft walls with 1+ gigabyte
images, complex vector illustrations, finely formatted text and it all prints
out dam close to how I planned for it down to exact colors that match specific
Pantone swatches. It's amazing what can be packed in to a pdf...

~~~
ogurechny
You are probably working directly with native formats embedded in PDF without
even processing the visualized elements. Adobe tools like to do that.

Sometimes, publishers make their PDF e-books from printed source in which
images are “optimized” to low quality JPEGs, but next to them non-display
Photoshop data streams with pristine megapixel illustrations are kept. If you
catch big PDF files, check their insides, it's one line of `mupdf extract`.

------
voicesarefree
Wish I had this to share with my boss years ago. My first big project at my
first post-college job was building a PDF parser that would generate
notifications if a process document had been updated and it was the first time
the logged in user was seeing it (to ensure they read the changelog of the
process). Even with a single source of the PDFs (one technical document
writer) I could only get a 70% success rate because the text I needed to parse
was all over the place, when I stated we would need to use OCR to get better
results no further development was done (ROI reasons). The technical writer
was unwilling to standardize more than they already had, or consider an
alternative upload process where they confirm the revision information.. which
didn't help.

I don't envy working on ingesting even more diverse PDFs.

------
jonathankoren
While it’s specifically Li t for PDF tables, I’ve found Tabula to be pretty
robust. I use it to convert banking statements to CSV.

It’s still a pretty manual process, but it does the most difficult part good
enough.

[http://tabula.technology/](http://tabula.technology/)

------
rotrux
This piece is extremely readable. Kudos to the author.

------
siftrics
I’m the founder of a startup ([https://siftrics.com](https://siftrics.com))
that’s trying to solve this problem. We’re growing extremely quickly.

Way too busy to write more, but I’ll be back to read comments later tonight.

------
mark-r
Not the only format Adobe made overly complex.

Wonderfully amusing code comment, about Adobe PSD format:
[https://fallenpegasus.livejournal.com/854615.html](https://fallenpegasus.livejournal.com/854615.html)

~~~
wpietri
Are there 30- to 40-year-old application formats that you think have done a
better job adapting to new needs and 4 to 6 orders of magnitude improvements
in the systems they run on?

~~~
mark-r
TIFF did a wonderful job of being forward thinking. It has been the base for a
number of other image file formats.

~~~
wpietri
I think TIFF has a number of advantages there. It was from the beginning an
interchange format, so it had the opportunity to look at a bunch of existing
formats and extract the commonality. It's also not an application format, so
the pace of change is slower and more controlled; it can trail rather than
lead. And it is of course a standard, which means a different set of dynamics
around how things get added and how clear the specs have to be.

That's not to say it isn't great; I could well believe it. But I'm just not
shocked that PSD and PostScript have ended up being a bit of a mess over the
decades. I doubt I could have done any better.

------
Ididntdothis
Since the title is a question my answer is “pretty much everything”.

------
whatisthetruth
I saw this and thought it might be something my company could use so I
contacted them for info.

The CEO Simon Mahony basically told me to piss off when I told him I thought
the site was misleading since there was no product or service to directly
purchase. They make custom developed software that you must pay their
consultants to integrate. I would not do business with such a company that
acts so unprofessionally even if they have a decent team.

------
gfxgirl
The one that hits me all the time is trying to reference the OpenGL and OpenGL
ES spec pdfs. The last numbered section of the specs contain state tables in
landscape layout vs the rest of the spec in portrait layout. Neither Chrome
nor Firefox's readers search the text in these tables that I need to reference
often.

The fact that text might be oriented different wasn't covered in the article.
IIRC Preview on Mac might search there (not near my mac ATM to check)

------
camillovisini
For academic papers: GROBID [0] is a machine learning library for extracting,
parsing and re-structuring raw documents such as PDF into structured XML/TEI
encoded documents with a particular focus on technical and scientific
publications.

[0] [https://github.com/kermitt2/grobid/](https://github.com/kermitt2/grobid/)

------
peterburkimsher
I had problems with copy-pasting Chinese text from PDFs before. The characters
would come out as Kangxi radicals, rather than Traditional Chinese characters.
They look the same, but are different code points!

[https://pingtype.github.io/docs/docs.html#translateButtons](https://pingtype.github.io/docs/docs.html#translateButtons)

------
thayne
I've worked on the other end of this: trying to make it easy to extract text
from PDFs that we generated. Turns out that is pretty hard too. There just
isn't a good way to include metadata about how text flows. So columns,
callouts, captions, etc. all cause problems. The PDF format just wasn't
designed for text extraction.

------
gbtw
I had great succes with TIKA and just OCR when finding to much many high
unicode vs normal characters.

------
bhanhfo
On the other hand... OCR is meanwhile so good that it can be used for many PDF
text extraction projects. So often there is no longer the need to bother with
PDF internals, just screenshot the PDF document and parse it. A free pdf ocr
service is for example ocr.space.

------
arsome
I'm still unclear on why SumatraPDF bothered implementing this anti-user copy
prevention feature, it's really annoying to have to actually break out
separate tooling to strip the flags on datasheets and schematics.

------
Abimelex
used to work on pdf extraction during my bachelor thesis analyzing german law
texts. The most fun part here was that the text came shipped in two columns.
Sometimes the extraction worked in correct order, sometimes the two lines from
two columnes where recognized as one line. I implemented at the end some kind
like this algorithm: see here, from chapter 4.4.
[https://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pd...](https://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pdf)

------
dredmorbius
I've been wrestling with a similar set of tasks, and have arrived at a similar
set of tools and options.

How you process PDF depends greatly on the _scale_ at which you're working
with documents. For large-volume, high-speed processing, automation is
necessary. Where you're translating a more stable corpus, human input may be
tractable. The ability to look at source PDF, OCR, and an edited text version
to correct for errors seems a part of that workflow.

Often it's possible to get _close_ or _approximate_ transcription using
standard tools. I've found the Poppler library's "pdftotext" remarkably good
with many PDFs, so long as there's some text within them:
[https://poppler.freedesktop.org](https://poppler.freedesktop.org)

There's a general concept I've been working toward of a minimum sufficient
document complexity, which follows a rough (though not strict) hierarchy. It's
remarkable how much online content is little more than paragraph-separated
text, _with no further structure_. Even images are not strictly informational,
but rather window-dressing.

Typically, additional elements added are hyperlinks, images, text emphasis
(italic and bold, often only the first), sections, lists, blockquotes, super-
and sub-script, in roughly that order.

(A study looking at the prevalence of specific semantic HTML elements within a
corpus would be ... interesting.)

Then there are the elements _NOT_ natively supported in HTML: equations,
endnotes/footnotes, tables of contents, etc.

It seems to me there should be an analogue to Komolgrov complexity as concerns
layout of textual documents. That is: there is a minimum _necessary_ and
_sufficient_ level of markup (perhaps: number, type, and relationship of
elements) necessary to lay out a specific work.

I've tagged out novel-length books in Markdown with little more than the
occasional italic and chapter marks.

Documents which use _more_ markup than is required are overspecified. This is
the underlying problem with a great deal of layout, and the ability to reduce
texts to their minimum complexity would be useful. It's a nontrivial problem,
though large swathes of it should be reasonably achievable.

Another approach would be for information-exchange formats to actually be, you
know, _information exchange formats_ rather than PDF.

(Though the latter is often, though not always, well-suited to reading.)

------
lukepdf
We're working on a number of fun problems like this over at PDFTron in
Vancouver! Currently growing and looking for software devs in a few different
areas. ltully(at)pdftron.com

------
ogurechny
Not exactly a industrial solution, but for common types of text documents used
by common people (i.e. books) k2pdfopt performs a lot of that magic under the
hood.

------
martingoodson
I wouldn't bother with parsing the pdf. Directly reading from pixels can be
more accurate than the parsed output, but will require some R&D. You'll need
very high recall text detection and an accurate algorithm for OCR. And _a lot_
of real documents as training data.

It's critical that the training data is good quality and much of the
engineering effort should go into good annotation interfaces. We built an end-
to-end system for all this at evolution.ai. Please email me if interested in
an off-the-shelf solution. martin@evolution.ai.

~~~
floriol
As he wrote it in the article, OCR is an order of magnitude slower.

------
ashishb
I faced many of the same issues while working on a side-project
(decksaver.ashishb.net) and eventually went for OCR approach to extract text.

------
1-6
Then what are some good options out there to keep text machine-centric while
offering aesthetic flexibility? Is markdown with CSS a thing?

~~~
BlueTemplar
What's wrong with HTML?

------
kabacha
Why don't we just get rid of PDF entirely? Clearly it's a flawed format.

Which brings me to a question: what alternatives do we have to PDF?

------
pierre
That's one of the reason we build Http://par.sr to try to get as clean data of
possible from pdf !

------
amelius
I suppose the best approach is to combine OCR techniques while taking hints
from the PDF structure.

~~~
wyattpeak
An order of magnitude increase of time is very significant. If you're just
processing a few documents with a lot of human oversight you may be right, but
it's definitely not a generalised best approach, at least going by the
article.

------
mixmastamyk
Glad it is, I don’t care for job sites trying to parse my resume.

~~~
72deluxe
Too true! That's probably why all of the agencies near me wanted .doc files so
they could scrape them and remove my address to insert themselves as middlemen
with the aim of holding both employer and prospective employee hostage to
their bounties.

------
jorgenveisdal
So true

------
animalnewbie
I wabt to read the article but it's pointless because PDF doesn't test
anything but ASCII chars well. Add some Asian languages and there is no way to
get that text back.

PDF needs to die. Djvu is good.

~~~
floriol
What do you mean? As I understand it it depends on the font - you can provide
any sort of encoding. So Unicode is there, I don't see how that would be
harder than with the latin abc (which is still a hard problem as per the
article)

------
peteretep
So all through this I’m thinking “just OCR it and be done”, and we get to:

> Why not OCR all the time? > Running OCR on a PDF scan usually takes at least
> an order of magnitude longer than extracting the text directly from the PDF.

... so? Google can OCR video and translate it in something that feels like
real-time; what PDF processing are they doing that is so performance bound?

> Difficulties with non-standard characters and glyphs OCR algorithms have a
> hard time dealing with novel characters, such as smiley faces,
> stars/circles/squares (used in bullet point lists), superscripts, complex
> mathematical symbols etc.

Sure, but more than the random shit you find in PDFs anyway?

> Extracting text from images offers no such hints

Finding an algorithm that approximates how a human approaches a page layout
doesn’t feel like it would be all that hard.

Obviously it’s very easy to stand on the sidelines and throw stones, but
parsing PDFs using anything other than OCR + some machine learning models to
work out what the type of a piece of text feels like pretending we are still
constrained by the processing costs of 5 years ago

~~~
jrandm
> Finding an algorithm that approximates how a human approaches a page layout
> doesn’t feel like it would be all that hard.

"In CS, it can be hard to explain the difference between the easy and the
virtually impossible."

[https://xkcd.com/1425/](https://xkcd.com/1425/)

~~~
Wiretrip
Totally agree. I worked on a project that had to try and extract tables from
PDFs. It is _much harder_ that it would first appear.

~~~
ldenoue
Detecting where tables are is still an active research areas. Once we know
where on the page, it’s easier to parse out their structure.

