
Show HN: PDFLayoutTextStripper – Converts PDF to text while keeping the layout - jlink
https://github.com/JonathanLink/PDFLayoutTextStripper
======
nemild
For those interested in converting PDF tables into CSV, there's also Tabula (
[http://tabula.technology/](http://tabula.technology/) )

(Used by many journalists to analyze the data in PDFs)

~~~
scrollaway
I find it absolutely ridiculous that we have to resort to these kinds of tools
:/

We have digital formats, and we decided to standardize document distribution
on the one that makes it as hard to extract data as if it were on physical
paper.

~~~
halomru
PDF is a perfectly fine and rich digital format. It also allows you to do
proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at
humans. But formats targeted at both computers and humans consistently fail
(XML with accompanying XSLT comes to mind), and/or only have terrible tools
for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making
alternatives viable.

~~~
TeMPOraL
> _It also allows you to do proper copy and paste, which is much saner than
> anything paper offers._

Technically it does that, yes. Rarely do I see people taking advantage of it,
though; most of the times I tried to copy some text out of PDF, the result had
to undergo a significant cleanup before becoming usable.

~~~
saghm
I've had similar experiences, as well as worse; one PDF I got from a bank
about my student loans a year or so back had ostensibly only text content, but
none of it was even able to be selected.

------
rsync
This is important for (al)pine users ... when reading email in a terminal it
is very useful to be able to open a PDF attachment as text and view it in the
(terminal) mailtool ...

Yes, (al)pine is my mailtool in 2017.

~~~
bsharitt
Also mutt, which I've switched back to recently. I've got a little Atom
powered Chromebook converted to Linux that just does not like modern heavy
webmail clients(even GMail when it was still running ChromeOS, and this is one
still on the market, Acer CB3-131) so a combination of mutt, mbsync, and msmtp
is a much nicer combo. Mutt is a terrific mail reader but its internal SMTP
and IMAP handling can be a bit iffy, hence mbsync and msmtp.

Though I can generally open attachments just fine, this text rendering of PDFs
would be useful for when I'm SSH'd into my home machine and reading stuff
remotely(usually from work where I don't want to download my personal email).

~~~
JetSpiegel
Mutt + mbsync + msmtp is my setup too. I'm using Neomutt, since that's being
actively maintained by a sizeable community of friendly people.

------
tyingq
Curious if this works better than the pdftotext utility that comes in the
Debian poppler-utils package.

That has a --layout option that works really well sometimes and really
terrible other times. Doesn't seem to be related to document complexity
either.

~~~
jlink
During the development I compared my results with the ones of pdftotext
utility and i obtained more or less similar results. The objective of my code
was to have an equivalent tool easily embeddable in any java/android project
and to learn more about apache pdfbox.

~~~
tyingq
I imagine it's not an easy task guessing about proportionally spaced fonts,
overlapping bounding boxes, columns, tables, wrapping, and so forth.

~~~
jlink
yes, definitely not easy but fortunately pdfbox offers a solid base to start
with.

------
WalterGR
Fairly frequently, OCR engines are posted here. But almost without exception,
they lack layout analysis, which renders them largely useless.

Is this something that could be combined with those OCR engines? (e.g.
TesseractOCR...)

~~~
RandomBookmarks
I would not call these services useless ;) - but I wonder the same... Some
apis like [https://ocr.space](https://ocr.space) return the coordinates of
each converted word. Can that be a used input? (I have not tried it yet)

~~~
PretzelFisch
ephesoft seems to use this for classifying and data extraction from documents.

------
jlink
who would be interested by an online website doing the job?

~~~
logicallee
if you really want to rake it in, serve, at static speeds (meaning instantly,
I swear, boot a ramdrive (Tmpfs) and serve static html from nginx all from
RAM), text versions of the top 10,000 web sites. there is so much crap on most
sites. re-crawl hourly.

monetize via Google adwords.

EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's.
I am suggesting serving tiny text renders of top sites, that otherwise are
much too bloated.

the hard part is getting the text and layout right. many people read many
sites for the text IMO.

So I am suggesting you make an all-text version.

As an example, the front page of the New York Times right now, copied into
Microsoft Word, is 2504 words. When I save from the word I copied into into
.txt - I get a 16.4 KB file.

By comparison, when I put the site into a Page Size Checker --
[http://smallseotools.com/website-page-size-
checker/](http://smallseotools.com/website-page-size-checker/) \-- I get
214.23 KB. That is impressively small, and it's a fast page.

If I try their competition, the Washington Post, I get 237 KB. If I try the
Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is
actually more what I was expecting - I'm impressed by the Times.)

Suppose someone desperately wants to glance at the Wall Street Journal from a
poor connection where they barely get data. The difference between 12 KB and
nearly a megabyte is huge. Its the difference between 4 seconds and 312
seconds: 4 seconds as compared with 5 full minutes.

So there is a large need in my opinion for such a service in case someone
desperately wants to see a text render. Preserving any formatting at all,
helps hugely.

~~~
cozzyd
You can use the SHA-1 of the PDF's to avoid serving the same pdf twice.

~~~
vgb2k11
You've been away from HN a few days? The SHA-1 collision example uses PDFs in
its demo. Hence other commenter saying SHA 256

~~~
jhlgkhkhil
The OP was almost certainly being sarcastic.

~~~
logicallee
For clarity can you edit your comment to add cozzyd (the OP you mention) - I
am sometimes sarcastic but not in this case. I'll then delete this comment.

------
marak830
Ahh this will be useful for my kitchen receipts. Thanks. Now I just need to
roll that with an auto translator too.(I guess I have my day off project now
:-) )

~~~
jlink
Happy to know it could help you. Good cooking to you!

~~~
marak830
Both I and my accountant thank you haha.

~~~
jlink
glad to hear that!

------
agumonkey
Fun, I did the same thing as a clojure repl exploration to pipe PDF text to a
bare Swing GUI (I know, a little absurd in a way).

The deja vu made squint for a minute.

ps: pdfbox is nice

------
robinhowlett
Nice. I recently got very familiar with PDFBox and parsing complex layouts -
it is a great library.

------
Animats
Is there a PDF to HTML converter which can consistently get line breaks right?

~~~
jlink
could be a nice feature but not easy task. I'll give a try, though.

~~~
ganwar
Please update us/me when you do. I'm also working on the same problem, would
love to chat.

------
curiousgal
Although I haven't tested this yet, these utilities tend to fail when fed a
table with empty cells.

~~~
gpvos
The first example image in the linked article shows a conversion from a table
with some empty cells. It looks fine.

~~~
curiousgal
Those are at the end. I meant empty cells in the middle. The ones I tried
don't account for them.

~~~
jlink
It works also with empty cells in the middle.

------
kzrdude
But does it keep both the layout and Sha-1 hash? Not sure it's HN worthy
otherwise.

