
PDF widely misunderstood by developers - abennett
http://www.itworld.com/development/76117/pdf-widely-misunderstood
======
wglb
Good quick read. Worth it for the line "See these telephone numbers? I need
data like that. I don't care how you get it; I'm just showing you this
particular representation, because you're a programmer, and we rarely
understand each other."

~~~
patio11
If I ever invent a time machine, I want to go back to my first year at the day
job and say "When the boss says he wants a 'web service', he really just needs
data from page X displayed on all these sidebars. Use an iframe and you won't
be stuck in the office doing overtime for the next 3 months fighting Java's
ridiculously obtuse frameworks."

~~~
yarone
If I had a time machine, I'd do something much cooler.

~~~
jodrellblank
With a time machine, you could be doing that and something much cooler,
simultaneously.

------
tvon
"PDF widely misunderstood by non-developers" seems to fit better.

~~~
absconditus
The example given in this subpar submission is not unique to PDF. The problem
would not be much easier if the resumes were in Word, RTF or even plain text.
There are numerous tools to extract text from PDF documents. The hard part of
this problem is finding discrete data in text that isn't in a standardized
format.

It is difficult to determine whether the author is referring to PDF as an
image format or if he actually means that the resume is an image in the PDF
file (as opposed to text created through OCR). If it is the latter, it is
again not a PDF problem. JPEG images would not make the problem less
difficult.

I also don't like the attitude that non-technical people should know what is
and isn't possible. That is our job. Many technical people claim things are
"impossible" when they don't want to do them as well. If a technical person
spends weeks on this before realizing that it is an extremely difficult
problem then they are incompetent.

~~~
fauigerzigerk
_The problem would not be much easier if the resumes were in Word, RTF or even
plain text. There are numerous tools to extract text from PDF documents._

That is not entirely true. PDF can be (and is frequently) generated in a way
that doesn't even allow you to extract the sequence of words
deterministically. A text, Word or RTF file always makes this possible (the
pathological case of text embedded in images notwithstanding).

There are tools to extract text from PDF, but all of them have to use more or
less reliable heuristics in order to recover the original order of words and
letters unless the PDF file was generated with particular settings that appear
to be non default in many tools.

------
Semiapies
The fun comes in, of course, when the PDF really _is_ the only available data
source (which, thankfully, has been rare for me). Then you just have to hope
you're dealing with standardized forms, or else you're in for some grief.

~~~
kenver
We created a digital archive of about 80 years worth of magazine issues that
were in all sorts of formats. The ocr worked pretty well in most cases, but we
found that the older stuff worked better due there being limited typography
and simpler layouts 80 years ago.

The more modern issues were only available in PDF and were the biggest
challenge, but the ocr still did a reasonably good job, even with the complex
layouts and fonts. The tricky bit was preserving the flow of the document i.e.
where to go when a column ends.

~~~
Semiapies
Extracting fields as opposed to documents is a different kettle of fish, I
think.

------
loumf
If you have pdfs that are scans of resumes (as in his example), then PDF Text
extraction is the least of your problems. It's actually extremely useful to
automatically generate an index of the words in resumes if you have a lot of
them, but you'll need OCR to do it.

------
dugmartin
I'd recommend PDF TextStream for anyone that needs to pull data from pdfs -
<http://snowtide.com/PDFTextStream>

~~~
krisneuharth
Recently I tried PyPDF (<http://pybrary.net/pyPdf/>) and was not happy with
the results. Has anyone else used anything they liked for parsing PDF, namely
extracting the text, that is open source?

------
tsestrich
As a soon-to-be graduating college student, this just makes me more and more
sad that my tastefully yet interestingly laid-out resume will be reduced to a
set of data that I could have spent 5 minutes putting into an email or a web
form. I guess it's nice to hand out at a job fair though...

