Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDFLayoutTextStripper – Converts PDF to text while keeping the layout (github.com/jonathanlink)
283 points by jlink on Feb 25, 2017 | hide | past | favorite | 92 comments

For those interested in converting PDF tables into CSV, there's also Tabula ( http://tabula.technology/ )

(Used by many journalists to analyze the data in PDFs)

I find it absolutely ridiculous that we have to resort to these kinds of tools :/

We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.

PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

> It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.

I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.

PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).

People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.

It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)

When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.

When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.

PDF has no paragraphs, often not even words. No concept of font notes. It doesn't flow well with different screen sizes.

If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.

Have you seen the spec for .doc and .xls ?

I don't even want to know :)

Oh, you really do! The format for COM object based documents like XLS and DOC is actually a FAT filesystem: https://en.wikipedia.org/wiki/Compound_File_Binary_Format

HTML is a viable alternative. And it is something everyone can parse easily, better yet if the data is tagged with classes somehow.

That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).

This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.

https://github.com/iffy/lhtml has something going in this direction.

epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.

That's Adobe. Look at their other formats, and PDF seems to be one of their better ones. Compare to SWF, PSD, AI and so on.

PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.

Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.

It's because PDFs have no concept of lines or paragraphs. It's just characters at an x,y co-ord which happen to line up. So figuring out whats a line or a column is a pain in the ass.

That's most likely why copying and pasting sucks too.

Yes, and more when you want to send a PDF based document to a Kindle.

I had to use Tabula to extract a decade of SAT scores from PDFs for each state/year. It was a nightmare, but I managed it. More recently, I was hoping to do something similar with decennial census data, but it was just too much. Far, far too many groups publish data to PDF, which is about as bad as if they just deleted it straight-out. It's very upsetting.

PDF is fucked up beyond all doubt. But there seems to be no better (even if unpopular) alternative.

How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.

So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?

epub is the way to go I think. PDF is an overengineered abomination. It nicely serves the purpose of "there is only one and exactly one way to render this", but then again, just about so does an image.

I also think that, with more love, epub could get there too. It's not an easy problem, but if we can crack SHA-1 I'm sure we can crack this one too :)

I don't see how epub can be pixel-perfect. It's almost as much a markup format, as fb2. Clearly more explanation of how should it be done is in order.

Pixel perfection is not necessary for 99.999% of the cases PDF is used in.

That's just ridiculous statement.

Microsoft XPS?

This is interesting. I never considered this one. How is it inferior to PDF, so that it is so much less widely spread?

It's not as versatile (no forms, for example), but layout- and prepress-wise it seems to be as good as PDF (with the benefit that it retains the structure).

Tabula is a great tool. In my experience it's the most reliable open source software for extracting tables from PDFs. We are using their underlying Tabula-Java library for some parts of https://docparser.com and are happily sponsoring their project.

I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.

Tabula is the nice free tool but requires technical background to run it. There is a free https://pdf.co with both online and offline tools (Windows) for PDF to CSV. (disclaimer: i work on it)

Hi there. We try to make a tool that's as simple to use as possible (given the constraints of a volunteer-run project such as Tabula). What technical background do you think is required to use it? (disclaimer: I'm the main author of Tabula)

hi and thank you for your work on Tabula! Well, some months ago I've advised to try Tabula and the first thing was the Java download page opened without any explanation. She managed to install java runtime and to try again but when was trying to upload files it was displaying either internal server error in jruby message or just a plain json in the browser. So, in my opinion and experience it may require some efforts to run it (at least for the first time). But to _use_ it, for sure, no such a technical background is required.

This is important for (al)pine users ... when reading email in a terminal it is very useful to be able to open a PDF attachment as text and view it in the (terminal) mailtool ...

Yes, (al)pine is my mailtool in 2017.

Also mutt, which I've switched back to recently. I've got a little Atom powered Chromebook converted to Linux that just does not like modern heavy webmail clients(even GMail when it was still running ChromeOS, and this is one still on the market, Acer CB3-131) so a combination of mutt, mbsync, and msmtp is a much nicer combo. Mutt is a terrific mail reader but its internal SMTP and IMAP handling can be a bit iffy, hence mbsync and msmtp.

Though I can generally open attachments just fine, this text rendering of PDFs would be useful for when I'm SSH'd into my home machine and reading stuff remotely(usually from work where I don't want to download my personal email).

Mutt + mbsync + msmtp is my setup too. I'm using Neomutt, since that's being actively maintained by a sizeable community of friendly people.

I use alpine in 2017.

It's easy to use, pluggable, and faster than any GUI I've touched.

As do I, because it's faster than browser based email and many GUI clients (like Thunderbird). I also like the fact that I can just copy across my .pinerc file to a new computer and my mail client is setup.

I had not considered the PDF issue. I just open them with an external application. The potential of reading them within Alpine hadn't occurred to me, but now it has, I want it!

Something like:

    application/pdf;pdftotxt %s /tmp/pdftxt && less /tmp/pdftxt
In your MAILCAP file?

No need to feel ashamed. I set up my own email server in 2016 and use mutt, squirrelmail, and iOS Mail very frequently.

Curious if this works better than the pdftotext utility that comes in the Debian poppler-utils package.

That has a --layout option that works really well sometimes and really terrible other times. Doesn't seem to be related to document complexity either.

I had used the xpdf [1] package, a C library and a set of CLI tools (mentioned by others in this thread too, and which the pdftotext command-line utility and xppdf/pdftotext library are a part of), in a consulting project for a client some years ago. (Client had asked me to evaluate some libraries for PDF text extraction, and then recommend one, which I did (I chose xpdf), and I then consulted to them on their product, using xtpdf for part of the work. Also did some post-processing of the extracted text in Python. Interesting project, overall.)

As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.

So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.

[1] Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.

During the development I compared my results with the ones of pdftotext utility and i obtained more or less similar results. The objective of my code was to have an equivalent tool easily embeddable in any java/android project and to learn more about apache pdfbox.

I imagine it's not an easy task guessing about proportionally spaced fonts, overlapping bounding boxes, columns, tables, wrapping, and so forth.

yes, definitely not easy but fortunately pdfbox offers a solid base to start with.

It probably works reasonably well with the documents it has been tested with. It's a very hard problem to crack if you ask me. (edit: word choice)

Also available for windows and mac at http://www.foolabs.com/xpdf/download.html

Last year, my boss gave me a task that looked simple enough at first glance - get data on how many vacation days each employee has in total, how many they have used in the current year, and how many they have left, and put that data in our SharePoint server (so people can see when filling out a vacation request if they actually have enough days left).

Most of that was fairly easy, except that the POS program that sits in the actual data only allows exporting data in one single format - PDF. Converting that PDF file to a CSV that I can feed into SharePoint was one of the nastiest things I did last year. I did manage to get it to work though, by toying around with pdftotext for a while and exploring its command line parameters.

It was a pleasure to use! It took me a while to discover the correct set of command line parameters I needed, but I got it to work! Thanks, xpdf!

Had several somewhat similar experiences in my career. I think the general public would be surprised at the amount of duct tape and chewing gum that's behind things that appear to be important processes.

pdftotext from xpdf (http://www.foolabs.com/xpdf/download.html) also has the -table option which usually works better than -layout. Unfortunately the poppler-utils fork doesn't have this option.

Fairly frequently, OCR engines are posted here. But almost without exception, they lack layout analysis, which renders them largely useless.

Is this something that could be combined with those OCR engines? (e.g. TesseractOCR...)

I would not call these services useless ;) - but I wonder the same... Some apis like https://ocr.space return the coordinates of each converted word. Can that be a used input? (I have not tried it yet)

ephesoft seems to use this for classifying and data extraction from documents.

some services allow you to set the layout manually: Docparser

PDF.co offline tool (for Windows) supports OCR and partial OCR for pdf to text and pdf to csv with layout preserved. (disclaimer: i work on it)

who would be interested by an online website doing the job?

if you really want to rake it in, serve, at static speeds (meaning instantly, I swear, boot a ramdrive (Tmpfs) and serve static html from nginx all from RAM), text versions of the top 10,000 web sites. there is so much crap on most sites. re-crawl hourly.

monetize via Google adwords.

EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's. I am suggesting serving tiny text renders of top sites, that otherwise are much too bloated.

the hard part is getting the text and layout right. many people read many sites for the text IMO.

So I am suggesting you make an all-text version.

As an example, the front page of the New York Times right now, copied into Microsoft Word, is 2504 words. When I save from the word I copied into into .txt - I get a 16.4 KB file.

By comparison, when I put the site into a Page Size Checker -- http://smallseotools.com/website-page-size-checker/ -- I get 214.23 KB. That is impressively small, and it's a fast page.

If I try their competition, the Washington Post, I get 237 KB. If I try the Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is actually more what I was expecting - I'm impressed by the Times.)

Suppose someone desperately wants to glance at the Wall Street Journal from a poor connection where they barely get data. The difference between 12 KB and nearly a megabyte is huge. Its the difference between 4 seconds and 312 seconds: 4 seconds as compared with 5 full minutes.

So there is a large need in my opinion for such a service in case someone desperately wants to see a text render. Preserving any formatting at all, helps hugely.

You can use the SHA-1 of the PDF's to avoid serving the same pdf twice.

SHA-256 ;)

You've been away from HN a few days? The SHA-1 collision example uses PDFs in its demo. Hence other commenter saying SHA 256

The OP was almost certainly being sarcastic.

For clarity can you edit your comment to add cozzyd (the OP you mention) - I am sometimes sarcastic but not in this case. I'll then delete this comment.

Sounds cool and all but two huge problems:

1) Copyright: completely re-serving the complete content of the top 100 sites with your own ads does not fall under fair use and would almost certainly be a magnet for lawsuits.

2) Distribution: how do you find your niche of people with poor internet connections and get them to use your mirror instead of whatever site it is they want to read?

no clue on 2. for 1, you could have it be "opera mini/turbo as a service" so that you are arguing you are just shifting the viewer to the site, but it's still the user doing the viewing. it helps if you preserve any text ads on the site (or links, with alt-text, given you're probably not doing images. you could also replace images with a grainy black-and-white very low-fidelity version, this also shifts most ads on the original site, without adding hugely to your footprint.) To be honest I also thought perhaps javascript etc could be run, so that the heaviest sites of all are still downloaded and then turned into text versions. In many cases that can let someone browse a site that is otherwise incredibly slow.

This isn't legal advice, just the approach I would use off of the top of my head. I agree with you that it's hard. with the framework "opera minifier/turbofier as a service" it could work, though. Like a remote browser. (in a VM). Like, present it as "lynx as a service." (Lynx being an old terminal-based text browser.) Something like that, anyway.

Doesn't Opera Mini or Turbo already provide this sevice? Perhaps add PPMD proxy text compression with an English dictionary with a JavaScript browser plugin on top of that. You can't get more efficient than that

Maybe, but asking someone to use a new browser is asking a lot. If you like, you can think of this as Opera minifier/turbofier as a service.

For what it's worth, here's a service that does that https://documentalchemy.com/demo/pdf2txt (and more: https://documentalchemy.com/demo)

Just tried the demos on this website.

I tried to extract text from a pdf that already has searchable text, which can be copy-pasted. This should be the easiest task of all but it made mistakes in every second word.

Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

> Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

Really? I'm pretty sure that's not the way this works.

thanks for sharing this one, didn't know it.

Yeah, sure, a public one for not privacy-critical PDFs plus something like a Heroku button to build own secure app (with auth and no storage).

See e.g. my file sharing app https://github.com/andreif/SecretFile

I bet most would, but privacy would be a big concern for me at least. script is optimal format for me

Privacy is the reason I'd prefer to do this in house.

Not to mention corporate privacy/IP.

Indeed. I might also show this to my wife(works in Nissan), might save her a bit of time.

check docparser.com

interesting service which was not present yet back in 2015 when I wrote my class.

Correct! We launched July 2016

Ahh this will be useful for my kitchen receipts. Thanks. Now I just need to roll that with an auto translator too.(I guess I have my day off project now :-) )

Happy to know it could help you. Good cooking to you!

Both I and my accountant thank you haha.

glad to hear that!

Fun, I did the same thing as a clojure repl exploration to pipe PDF text to a bare Swing GUI (I know, a little absurd in a way).

The deja vu made squint for a minute.

ps: pdfbox is nice

Nice. I recently got very familiar with PDFBox and parsing complex layouts - it is a great library.

Is there a PDF to HTML converter which can consistently get line breaks right?

could be a nice feature but not easy task. I'll give a try, though.

Please update us/me when you do. I'm also working on the same problem, would love to chat.

Although I haven't tested this yet, these utilities tend to fail when fed a table with empty cells.

The first example image in the linked article shows a conversion from a table with some empty cells. It looks fine.

Those are at the end. I meant empty cells in the middle. The ones I tried don't account for them.

It works also with empty cells in the middle.

But does it keep both the layout and Sha-1 hash? Not sure it's HN worthy otherwise.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact