Hacker News new | comments | show | ask | jobs | submit login
A Python Library to extract tabular data from PDFs (socialcops.com)
501 points by leenasoni99 59 days ago | hide | past | web | favorite | 99 comments



Cool! That's a good intro too.

Many people don't realise the general weird disconnect in PDFs between real content and what you see on the screen that makes it hard to recover source data. In extreme cases you have subset fonts with glyphs ordered completely differently from how they are in the original and no mapping back to the character they represent. Then the graphics stream is instructions to draw glyphs at coordinates. As you can imagine it's quite a battle to get back to something "raw" (assuming you even had fonts to start with).


If PDF is just "characters at coordinates" then getting the data out seems to require all functions of an OCR engine outside of character recognition per se (namely, layout detection). And with botched fonts, you essentially need the full package.

I so much want to see the day when PDF is dead like Flash.


The great thing about PDF, even the reason for its existence and adoption, is that a (valid) PDF file will look exactly the same — the same characters in the same fonts at exactly the same positions on every page — on any printer or display across the world, and across time. With HTML (what you propose) it is hard even to get something to look the same at different browser window widths, let alone different devices or different versions of browsers.

Sometimes that's what you want (and when the visual appearance is not important, it may make sense to not use PDF), but I definitely wouldn't want to see PDF “dead”.


Yeah, except the devices are different, so there's no point in trying to use the same page size on them, you only torture the reader.

> With HTML (what you propose) it is hard even to get something to look the same at different browser window widths, let alone different devices or different versions of browsers.

Only if you're trying to use some fancy layout, or if your idea of ‘the same’ is literal. Use a simple ‘text, image, text’ layout like it's the days of HTML 2.0, but with better formatting—and you'll have zero problems reformatting for different displays or reflowing the document into columns. Notice how all popular content sites adopted this layout in the main content column of their pages—and the pages work nicely on both desktop and mobile devices, and are captured fine with Pocket, Evernote and the like.

If you're trying to use a fancy layout for a paper-like publication, the question is why the hell you're doing that.


Only if you're trying to use some fancy layout, or if your idea of ‘the same’ is literal.

Actually, the plainest CSS-free HTML renders inexplicably small in some modern flagship phones... I mean the proprietary viewport meta tag, which is in the process of becoming CSS: https://www.w3.org/TR/css-device-adapt-1/


Yes, my idea of “the same” is “the same”.

I understand what I believe to be your actual point: it would be nice if documents were more often published in a format that doesn't completely fix their layout and visual appearance. And I agree with that! When I'm reading something purely for its information, and don't care too much about the appearance, I too would like it if it weren't in a visually-fixed format. (That's what I said in the first comment tot: “when the visual appearance is not important, […] not use PDF”.)

But my point is that for the goal of completely fixing the visual appearance, PDF is a pretty decent format (better than say, photographic images of the page), which is why it exists.

When you say you want “the day when PDF is dead”, it appears as though you cannot imagine anyone wanting that goal.

Here are two examples:

Suppose you are an author of books (a physical artefact that will inhabit libraries for centuries; forget about digital displays and all that nonsense for a moment) and care about their typographic quality. Then you will want to make sure of things like:

• that each paragraph contains appropriate line-breaks (http://eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf), so that the page as a whole has a good “texture” or “greyness” (or “colour”),

• that the words have hyphenation (to make the aforementioned good line-breaks possible), but not any poor hyphenation (https://tug.org/docs/liang/),

• that the typefaces chosen are in harmony with each other, that the paper size leads to a good “form factor” for your book, and is appropriate given the kind of binding used, etc.

• and finally, that after you have carefully proofread and verified every line of every page, the reader will not see something totally different, with lines of different widths broken in different places, etc.

Or if you cannot relate to that example, then forget all that, because it's just a special case of a simpler, more general case: suppose you know that your document is ultimately going to be read on paper, and you'd like to make sure it can look the same ten years from now as it does today.

Then PDF (especially PDF/A) is a decent format for this case.

(PS: I've seen very few websites that have good typography in the sense that when printed they approach anything like the quality of a halfway decent book.)


PDF (especially full spec'd PDF) seems bad, but I never like reading epub or other formats. PDFs are the static builds of documents.


> all functions of an OCR engine outside of character recognition per se

Actually, depending upon how 'obfuscated' an author was attempting to be, you might need that OCR engine itself.

PDF allows for defining arbitrary mappings from byte values to font glyphs. So one could define byte value 32 (decimal, usually ASCII space) to actually map to printing, say, a capitol letter Z instead. One is supposed to provide a reverse mapping table when one does this that says "a decimal 32 byte prints a capitol letter Z" to allow for search and extraction purposes. But the PDF spec. does not require this reverse table be present.

So it is quite possible to randomly assign font glyphs to arbitrary byte values, and omit the reverse mapping table. The result would be that extracting data back out of that PDF results in garbage if one does not know beforehand what the mapping from byte value to glyph was.

So, if a 'bad actor' did this, one's only recourse to retrieving data would be to rasterize the PDF to a bitmap, then OCR the resulting bitmap to extract the content back out.


Or perform frequency analysis on the simple substitution chiffre. Seriously though, we need a document format with easier to extract payloads. Like Office documents with stronger structure, an underlying schema, along the lines of react-json-schema-form for Word.


ODF - Open document format, it is not perfect but a lot better than Microsofts formats.


I've had this issue with LaTeX produced docs before (scans bad actor).


I think it's probably Stockholm syndrome but I quite enjoying digging into PDFs. There are actually other ways of embedding the text where you have more information to go on. "Characters at coordinates" is a particularly rough hand to be dealt in a PDF (though it definitely happens).


I have been recently writing a library in go to fill out AcroForms, and really, the PDF spec and syntax isn't really that bad or hard to understand. Most everything is a dictonary object. I've seen waaaay worse over-abused OOP monstrosities of a code base in C# and Java

PDF 2.0 (ISO 32000-2) has been out for a while and supposedly it has eliminated a lot of the cruft from the spec. I just wish it was open like PDF 1.7


And replaced with what?


With Latex published to HTML. I'm curious to hear what widespread use-cases of PDF aren't covered by HTML with additions like MathML. Even HTML with fallback to SVG for complex sections would be a gigantic step forward from PDF.


As somebody who's written papers for scientific conferences and journals, I can tell you that this just isn't going to work. I'm not a designer and making a LaTeX-generated PDF with few frills look decent is enough work already. Generating HTML and having to verify that it looks good on every major browser whether it's a mobile or desktop device is just completely out of the question.

Seriously, think about it -- plenty of "professional" webpages with full time designers and UX engineers look like shit on mobile or Firefox or whatever (as you'll know if you've ever read reddit/HN comments). Imagine what a shitshow HTML papers from academics who are desperately doing whatever they can to get a readable version of their paper formatted during the 20 minutes before the deadline would be.


For some reason I have to battle the same argument again and again in this thread. It's like people are blinded from having seen glamour magazines with crazy layouts.

The solution is that you don't use fancy layouts for papers. Imagine that you only have HTML 2.0. Put your text in paragraphs, put images and large formulas in separate paragraphs between those of text. Now chuck that into a ready-made styling template that applies modern typographic conventions. Voila, you have a great-looking article that can be read on a display of any size, be that today or three decades from now. It can be reformatted into columns or stringed into a horizontal ribbon, printed on paper of any size, read aloud by a text-to-speech engine, saved in apps like Pocket or Evernote.

Most popular content sites today use this layout for the main content column, and the pages can be read fine on phones or saved in apps. Markdown readmes on Github use this layout, and it's smooth sailing with them. Pages in HTML 2.0 from the 90s display just fine on modern devices, aside from the different text size.

You don't need to be a designer or make sure that your articles look fine on different devices if you stick to this simple layout and use tested styling. Pages with full-time designers have problems on mobile devices because those people try to do fancy layouts. Don't use fancy layouts for papers. I've spent zero time fixing problems with layouts in Markdown or, by the way, in posts and comments on sites like Reddit and HN, because they don't allow me to do fancy layouts—and they stay readable on phones. If authors have to spend time fixing layouts of their papers, it's because they use too complex layouts which indeed would have problems displaying on different devices.


Okay, I could be convinced that a Markdown-type format with HTML output and very restrictive formatting could work, as long as it really includes (nearly) all of the support for tables, images, and math notation that LaTeX does, and displays well in any reasonable browser/platform. But is there an implemented working compiler of such a format?

I sort of doubt it. (But if I'm wrong, please do post it.) Going back to Reddit comments again, even Reddit's version of Markdown (which only allows basic text formatting and simple tables, no images or math notation) is broken as hell in their own official mobile app (at least the iOS one). Tables are screwed up, and even bold/italic is somehow buggy. And there are (probably multiple) engineers paid full time SF salaries to work on Reddit's mobile app.

Also, there would still need to be a canonical print format for this to work with current conference/journal rules, which typically include page limits. And for good reason: nobody except the authors (least of all reviewers) want papers to be any longer than they are. (Sure, you could change to a word or character limit, but then you'd have unlimited images, which would incentivize stuffing tons of information and text into figures and using tons of those. So you'd have to bring in another requirement on total image size, or something. And you can see how this quickly gets overly complicated and you'd really rather just have a simple page limit.)


You should look at several papers written in scribble(racket doc templating language), they are beautiful, not hard to author(compared to latex).

Here is an example: http://www.cs.utah.edu/plt/scope-sets .

Although some of the models on semantics are pictures generated from latex :p


Thanks for the link -- this does look really nice. But I do think the fact that they resort to using LaTeX-generated images for text [1] (as you mentioned) sort of proves my point that the existing implementations of this sort of format aren't really flexible enough for a lot of scientific papers yet.

However, I admit this could be a reasonable way for even impatient and stubborn researchers to publish papers, given the right implementation. I'll withdraw my initial arrogant "I can tell you that ..." :P

[1] http://www.cs.utah.edu/plt/scope-sets/model.html#%28part._.S...


Scribble's styles might have been touched by Matthew Butterick, whose online-book “Practical Typography” may IMO be the most beautiful site on the whole web, owing to the masterful application of fonts and margins: https://practicaltypography.com/

(The list of MB's commits is not telling much, but Flatt having written the foreword to MB's other book “Beautiful Racket” is more suggestive.)


Wow, that indeed is a beautiful online book, better than any Sphinx template I've so far come across. You've got me interested in checking out Scribble.


HTML and PDF are used for different purposes.

PDF is file format presenting fixed-layout documents application independent manner. You don't want to lose universal standard for that.

HTML is markup for presenting documents application dependent manner.


They certainly should be used for different purposes, but currently I don't see why PDF is necessary for all the papers. Why do they need fixed layout? Plenty of them are already published in both PDF and HTML, what's different about the rest? It's especially baffling in the case of computer science and programming papers when the contents are the same as in blogs.

I have dozens of PDFs in my reading queue, for which I'll probably have to buy a tablet. Why can't I read the same columns of text and pictures on my electroink reader when I can do that with HTML? Who the hell knows.


PDF is for printing and for (well, with the exception of a few edge cases) guaranteeing a layout and display given a set of paper dimensions. HTML has the advantage of responsiveness, but the inherent problem of variable output.

When I was a professor and advising my students on creating portfolios, I told them to build websites of course. But I told them to also have a link to a one-page PDF because many organizations (not just academia) forward resumes within an organization to someone senior who eventually prints it out. And you don't want that person's first impression be whatever your website's print.css churns out.


Variable output is not a problem, it's exactly what's needed. The days of standard paper formats are over, old man: in a decade people will have documents delivered straight to their retinas, or read into their ears—but everyone will still have to scroll PDFs back and forth with no possibility of reformatting, because lots of papers are published in it.

If you want to have your document printed nicely, just prepare it for printing along with other methods of output. The best way to do it is to not use some crazy layout: have a single column with images between paragraphs, and your documents will look fine on any device. All problems of reformatting documents stem from the rigid two-dimensional layout mentality, while the flexible approach requires stepping back to the one-dimensional semantic flow.

(Actually, standard paper formats were never around, because—surprise—my country doesn't use US paper formats.)


No, variable output is not "exactly what's needed". Layout of information is an actual skill -- whether it's a resume, a newspaper front page, or a photo gallery -- and we can expect layout to be an important design factor as long as humans have eyes that can process information in formats other than a byte stream.

HTML has been an excellent format for delivering data and information across innumerable devices and visual dimensions. That adaptability comes with tradeoffs. As others have pointed out, anyone who's browsed the Internet Archive knows how HTML, beautiful and organized in its own time, can look like slop today. Paper/PDF's tradeoff, of course, is its rigidity.


While I agree that PDFs are at times cumbersome to use, I can't think of a valid solution to replace them.

- Fixed layout seems much easier to handle than dynamic layouts. I.e. I can't recall any website that resizes the content correctly (correctly meaning i see the image within X% of scrolling of the referenced location; that doesn't just make the lines super-long). And without handling this properly, most of the arguments against PDF usage seem to go out the window.

- I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.

- How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)


> I can't recall any website that resizes the content correctly (correctly meaning i see the image within X% of scrolling of the referenced location; that doesn't just make the lines super-long)

I rarely have anything like that happen, so not even sure if I know the exact problem that you have in mind. As far as I can tell, it's specific to when authors put images somewhere distant to the text that mentions them in the one dimension of text flow, e.g. on the next page, or floating in a separate column from the text. The solution is, don't put images far from the text. HTML obviously requires a different approach from PDF: you don't think in terms of two-dimensional physical layout, you think in the one dimension of semantic layout. Most popular content sites are laid out that way now, and I mostly have no problem reading on the desktop or the phone.

> I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.

It requires software just as PDF requires it. Such software isn't ubiquitous precisely because people don't see HTML annotation as a market. It's a typical chicken-and-egg market problem.

To annotate HTML, you abandon the two-dimensional graphical approach just as you do it when producing the document. Instead, you highlight text in paragraphs and attach annotations and drawings to it, independently of the current rendering of the document. Any word processor allows you to highlight text in lines and paragraphs, you do the same thing here. Evernote's web clipper highlights HTML just fine.

> How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)

You use a format that packs HTML with images, styling and fonts—e.g. MAFF. Come on, it's not rocket science to store what the server sends to the browser. Again, Evernote stores pages fine and could be used for sharing (if the program didn't go to crap overall). It's the same chicken-and-egg problem.


> I rarely have anything like that happen, so not even sure if I know the exact problem that you have in mind.

Not even trying to be funny but do you mind sharing some websites that dynamically resize content correctly? I just checked a couple of the usual suspects (reuters, nytimes, guardian, github) and none do it. They are all using (semi-)fixed layouts.


I have the opposite problem of finding a page that could be problematic. Flipped through several articles on those sites, and they all use the linear article layout.

Remember that we're talking about publication of static papers, so you look at the main content column on a page, since that's what should be there in a paper. In the main column, those sites use the simple linear flow: ‘text, image, text’—with images occupying entire paragraphs instead of floating to the sides. With this layout, you can reformat articles every which way, string them into horizontal pages, render them in columns or read them with text-to-speech, etc. It's essentially HTML 2.0 layout but with better formatting. Markdown readmes on Github are the perfect example of this approach.

I've regularly used Evernote for capturing web pages, and Pocket to read them on the phone, and they have no problem with storing main content from such articles, stripped of extraneous navigation (outside of Pocket's bugs with dropping some content, presumably from overzealous anti-ad measures).

You don't look at images outside of the main content column for this discussion, because those aren't what should be there in static paper-like publications—unless the images are related to the content. And if the images are related to the content, the question is why the author is trying to use a fancy layout for such a publication.

(NYTimes do sometimes use more complex layouts in feature articles, with dynamic effects—but they, presumably, don't target those for long-term archival, and instead they customize the pages for mobile and desktop access separately. Anyway, they also should tone that down if they want readership via something like Pocket.)

I most often have problems with images on Wikipedia, because they make images float to the right side since they have many non-essential but illustrative images. Those, indeed, tend to detach from the relevant text.


Sorry, i misunderstood you. I thought you wanted to move away from PDFs because they don't resize. But none of the example i gave resize either (neither does pocket or instapaper).


“Resize” is an ambiguous term, so you may or may not have understood me correctly, I'm still not sure. My (primary) problem with PDF is that it doesn't adapt to displays of different sizes—mobile, electroink and tablet devices in addition to desktop machines—and can't be reformatted on a device (e.g. to adjust the author's typesetting choices).

Did you mean “zooming” the page in/out on the same device? That's not a big issue, in my experience: I zoom in on almost every page due to myopia, and rarely have problems. I adjust text properties on mobile devices too, namely in Pocket and e-book readers (which use HTML under the hood these days). Technically, HTML can be rendered with a rigid layout and just be zoomed in/out like a static image—it's a question of the client having this function, or, I think, can be done via a simple CSS property.

If that's still not what you had in mind, I'd like to know what you mean by “resize,” out of professional curiosity.


Fair point about font-resizing.

Since you are curious due to professional curiosity: what i meant by resize is the utilization of the device's screen. If my screen allows for a 1200px wide browser window, the main content shouldn't use 800px of it. On my 5000px wide screen, nytimes.com articles seem to utilize a whooping 10-15% (i am guessing). Might as well just send me a fixed-layout PDF.

That being said, I doubt it is computationally easy to compute a good layout. Considering how slow latex compiles a PDF, trying to find the optimal layout for a non-rigid layout seems difficult with the time constraint at hand.


Oh, I happen to know a bit about this issue. It's very much not recommended to have long lines of text, as you may already know—because that way the eye has trouble finding the next line when returning from the end of the previous one, and the entire reading endeavor becomes rather janky experience. That's one of the primary reasons that we have book pages in portrait orientation and that newspaper articles are stretched in vertical columns. With this limitation, it would be quite pointless to try “utilizing” the screen area with other elements, since they can't just be arbitrarily hanging around the text.

If you're doing a lot of reading, you would do better by having your screen in portrait orientation. Wide screens are better suited for other tasks.

I'm tempted to note, however, that HTML with a simple layout, again, can technically be hammered into displaying in several columns on a wide screen. You'd probably want/need site-specific solutions if you want to keep the site's navigation. But if you need only the main content, you could use an extension akin to the “reading mode” of Firefox/Safari/Pocket, and override the CSS to break content into columns. (There might also be such extensions around that already have columns built in.)


PDF is for long term archival.

There is no standard and widely recognized long term archival format for HTML pages (with all the extras). Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.

You just don't know how your HTML and JavaScript renders 10 - 15 years from now. If you look old Web Archieve files you start to see how they become crap over time.


HTML is the format. You pack it with images, CSS and whatever else, and you have the distribution format.

> Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.

Not enough how? What is there that you need besides what the server hands to you, if that's what rendered in the first place? What magical compression methods do you have in PDF that are better than ZIP compression used in MAFF, for example?

> You just don't know how your HTML and JavaScript renders 10 - 15 years from now. If you look old Web Archive files you start to see how they become crap over time.

Have a static HTML version that's rendered the same in the future. You know, the same way that you have a static PDF standard.

How do you render Javascript in PDFs in a standard way? You don't use Javascript, that's how. Javascript is not for publication of static semantic text, so you don't use Javascript for papers, it's a no-brainer.


> HTML is the format. You pack it with images, CSS and whatever else, and you have the distribution format.

HTML is not a good format and standard for that purpose. It's loose best effort markup with no good consensus on semantics. HTML with images is not good option for papers which have many equations.

EPUB3 is emerging standard for what you want, but it's not really good complete solution that can replace PDF/A or TeX/LaTeX

> Have a static HTML version that's rendered the same in the future

We don't have that.


> It's loose best effort markup with no good consensus on semantics.

And PDF has good semantics? Are we still on the topic of how HTML is better than PDF, or…? We're in the comments for a page that says that PDF tables are characters just floating in space, and people are saying most PDFs out there don't have semantic markup. Meanwhile HTML had semantics efforts for decades now, just choose your flavor.

Blind people read HTML, you know. Do they read PDFs?

> HTML with images is not good option for papers which have many equations.

There's MathML for that, and IIRC other formats too. You could even have embedded TeX like Anki has. Use SVG for fallback.

>> Have a static HTML version that's rendered the same in the future

> We don't have that.

Ooh, chicken-and-egg again? Freeze any of the versions from the past decade with the rendering standards, and you'll have it.

But actually, it doesn't even matter, just like HTML 2.0 can be rendered fine on modern devices (aside from the different text size). Treat your paper as a paper instead of a webzine, don't use crazy layouts, just do “text, image, text” which you'll want anyway for the different displays—and your document will render fine in the future when it will be delivered straight to the retina, instead of making me scroll the PDF back and forth because no reflow.


PDFs are used to build buildings from, for one. And there's a time when you want your layout to be exact :-)


"I so much want to see the day when PDF is dead like Flash. "

Totally agree. A while ago I had to write code to import a ton of PDF files and it was just infuriating to realize that we have data in highly structured documents, throw all structure away to create a PDF and then we somehow have to divine that structure back from the PDF with enormous effort and only partial success. It's just a horrible, horrible file format for what it's used now.


Adobe released a way to attach data tables with PDFs. But I think it hasn't been adopted fully, since many organizations that release open data as PDFs don't tag the accompanying data tables. https://www.w3.org/TR/WCAG20-TECHS/PDF6.html


Sounds like OCR is really the universal method. I'm guessing it shouldn't even be as hard as full-blown OCR, since you have access to fonts used, so you can render known characters as a reference and pretty much run a per-pixel matching on rendered PDF.


The extreme cases (where there is no/incorrect mapping between the glyph and the character they represent) are a real pain! This mapping is stored as a ToUnicode map inside the PDF. In the past I've used OCR to handle such cases but I'm planning to create an experimental interface where anyone can modify the ToUnicode map. The challenge would be to make the modifications automated/user friendly.


Wicked. Sometimes the ToUnicode map is missing so we actually rebuild it ourselves from other information we find in the PDF.


This sounds neat. Thanks for the work, vortex_ape and others. When I last needed this, I used tabula via tabula-py. Tried camelot on the PDF [0] I worked on and unfortunately the default option returned less-workable dataframe than tabula-py. I think it's just the area detection of stream and you are working on it anyway so I'm really looking forward to see the results.

btw, I think the pip install requirements missed opencv-python (on Windows?). And in this doc [2], it should be "top left and bottom right" instead of "left-top and right-bottom".

[1] https://www.boj.or.jp/en/statistics/set/kess/release/2018/ke...

[2] https://camelot-py.readthedocs.io/en/master/user/advanced.ht...


Hey squaresmile! Yes, right now table detection with Stream doesn't work nicely if the table is not present on the full page, for which you can use the table_area kwarg from [2].

You should use "pip install camelot-py[all]" to install Camelot (which will install opencv-python too). I had to take it out of the requirements since it wasn't available in any conda channels while I was creating the conda package. I'm looking to remove opencv as a requirement altogether by either vendorizing the opencv code that is being used inside Camelot or reimplementing the code using something lightweight like pillow.

Thanks for the catch in [2], I'll correct it!


Are you also working on extracting tabular data from scanned image files?


Quick suggestion - you should integrate the functions to extract signature data inside PDF. This is a huge issue and everyone has to write their own.

for example, this is my sample piece of code to extract data from Aadhaar signed PDF https://pastebin.com/dg8p98T1


Thanks for the suggestion sandGorgon! Can you also point me to an example of a PDF with signature data?


unfortunately i cannot share without running afoul of all the laws out there. but you can create your own here - https://app.digio.in/#/authenticate


Ah sorry I forgot about posting PII data online. Thanks for the link!


This is a really good example of how to briefly introduce/sell a library. What it does, why, how, how to install it, with concrete examples.


The API and docs (on which the blog post built upon) were inspired from pandas and requests!


A few months ago I was looking for a similar solution but couldn't find one that handles empty cells very well. I ended up writing my own program[0] that is specific to my files' layout.

This library works perfectly and could've saved me a lot of time! Looking at some of the source code, we used similar logic to parse the tables. Pretty neat!

0.https://github.com/khllkcm/pdf2calendar


Will check out pdf2calendar!


This is nice. I do quite a bit of tabular data extraction and pdf tables are often a sticking point. It is absolutely correct in describing it as a "fuzzy" problem.

My go-to solution has been 'pdftotext -layout' with a bit of hackery before giving it to pandas.read_fwf. That usually gets me 80% of the way there 80% of the time. The upside is that this tends to fail "better" than some other options.

I look forward to kicking-the-tires with this on my test cases.


Do submit bugs on GitHub if you face any issues! https://github.com/socialcopsdev/camelot


This is a very interesting software. In research community still many results are only in pdf tables in papers, so obtaining them in dataframe is very useful, good job!. By the way, I would like export also in Excel files in the command line.


Hey danimolina! You can export the data into an excel by specifying it as the export format, Camelot comes with a command-line interface too! https://camelot-py.readthedocs.io/en/master/user/cli.html#cl...

You can simple do: camelot --output data.xlsx --format excel lattice input.pdf (lattice can be replaced with stream based on the type of tables in your PDF)


> However, OpenCV’s Hough Line Transform returned only line equations.

Did you try HoughLinesP? https://docs.opencv.org/2.4/modules/imgproc/doc/feature_dete...

Returns line segment endpoints with a probabilistic Hough Transform. I'm fully confident your solution works, just wondering if you tried this and why it was rejected.


Hi plaidfuji! I did try HoughLinesP during experimentation. I vaguely remember (since this was almost 2 years back) getting the actual line segment as a combination of multiple smaller line segments in all cases (which could then be combined to form the actual segment using some heuristic). It came down to getting the actual table line segment out which a combination morphological transformations and cv2.findContours provided (without the need for another combining step).


Interesting. I noticed you mentioned below that you're trying to get rid of OpenCV as a dependency - that's really tough. I came from a Matlab background where image processing was really well-packaged and Python is a total mess.

If you managed to vendor a small portion of OpenCV that contained image i/o, basic colorspace conversion, thresholding, scaling/rotating, shape drawing/insertion, HoughLines and findContours, I think you could release that as its own package and it would be quite popular. OpenCV is such a bloated dependency...


scikit-image contains Hough transforms and the other things you mention? Though it does depends on scipy and matplotlib, which are kinda big.


Oh that is so timely. I've been putting that part of a pipeline I built off for a while due to the complexity and now I can just plug this in. Super neat. Thank you very much!


What does this pipeline do and what software have you used to implement it?

I have used Airflow in the past to create ETL pipelines, and plugged in Camelot in one of them to extract tables from PDFs. I also wrote a blog post about it in case you might be interested. https://hackernoon.com/how-to-create-a-workflow-in-apache-ai...


Compress scientific papers.

Thank you for the pointers!



Yeah. pdfplumber is also good for digital pdfs. Curious to know the advantages of camelot over this.!


> When using Stream, tables aren’t autodetected. Stream treats the whole page as a single table

I've often wondered if image semantic segmentation methods as used in the ML community could successfully identify things like "there is a table (or figure) here, it's not part of the main text". I mean, it seems that humans should be able to do this even without reading the text so I don't see why a CNN couldn't.


Yes it should work. Definitely worth trying.


So many times I have wanted to get this type of data. Visa would send reporting this way and it would have to be manually copied over. They offered CSV but there were extra charges associated. There were some pretty good libraries for paragraph text extraction but the graphs were too tough to deal with.


I hope this is good at extracting register maps from datasheets. That would save a lot of tedious driver work.


Hi Berti,

We are building a tool that can extract register map from a data sheet. You may take a look at the beta version of the tool at http://exportrm.soliton.ai/. The tool supports register map table formats similar to ones found on page 14 in https://www.nxp.com/docs/en/data-sheet/PCA9685.pdf datasheet. The table format that you have mentioned in the example is not supported at this time. We would be adding it soon. If you have any questions or would like to share any feedback, feel free to reach out to us at the email id provided on tool web page.


Hi berti! I wrote the library and the blog post. Can you point me to some PDFs which have these register maps?


Try these: Page 233 http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8351-M...

Page 45 https://ae-bst.resource.bosch.com/media/_tech/media/datashee...

Is the library able to handle cells that span multiple columns?


I assumed that you're talking about page 33 in the first PDF, since it has only 225 pages. I extracted Figure 6-23 from it and the table on page 45 in the second PDF. Here's a gist: https://gist.github.com/vinayak-mehta/cf30a5560f1b8ab4c0b25e...

Yes, Camelot takes care of cells spanning multiple columns! You can check out the Advanced Usage section for explanation on the keyword arguments I used in the gist! https://camelot-py.readthedocs.io/en/master/user/advanced.ht...


Note: I had to decrypt the second PDF using qpdf since the library I'm using to split a PDF into pages (PyPDF2) doesn't support the encryption type of that PDF.

Did this: qpdf --decrypt input.pdf output.pdf


Sorry! Page 203 in the first PDF has a full table if registers, and that bits in them. Thanks very much for the library, and even more for taking the time to create that notebook. The captured data looks excellent. I'll hax some code to translate this data to a header file suitable for writing a driver.


I'm curious why the authors didn't contribute this directly to Tabula instead.


Interesting. I've used Tabula [0] in the past with great success. I wonder how this compares.

[0]: https://github.com/tabulapdf/tabula


They have a detailed comparison with other tools (including Tabula) in the wiki:

https://github.com/socialcopsdev/camelot/wiki/Comparison-wit...


This just worked for me, thank you!


Great work and write up! HN submissions about PDF extraction seem to be as reliably popular as threads mentioning bees or bashing Mongo, which I guess goes to show how pervasive a problem it is.


Is there any decent tool for tabluar data extraction from scanned PDFs?


Hey andrew_chris, we're working on it and will be interested to help you. Please contact me: tanmay [at] inkredo [dot] in


A Python library to do this is cool, but there's already Tabula: https://tabula.technology/


They address Tabula in the post:

>The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.



You should also check out pdf.js:

We use it in Polar:

https://getpolarized.io/

for our PDF management.

It's a pretty robust library and it renders everything on canvas BUT you also get the raw text in the DOM so you can play with it more as an API for managing PDFs.

REALLY nice to be able to use web standards when working with pdf.js.

The downside is that the graphics are rendered to canvas so you're only really getting an image.


Wouldn't it be easier and more generic to have an OCR solution for this task?


Hey amelius! Though OCR would provide a generic solution, it would be an overkill for text-based PDFs. I'm working on getting a OCR solution up since there's still a lot of data that is trapped inside scanned PDFs and not text-based ones.

If you have any pointers in the OCR route, do suggest them here, or on this GitHub issue! https://github.com/socialcopsdev/camelot/issues/101


Hey vortex_ape, we're also working on extracting data trapped inside scanned PDFs and recently, we've begun to get good results using DL algos. I am based in Gurugram, would you like to catch up and exchange experiences?


It seems that generally you'd want all functions of an OCR engine aside from the character recognition itself—namely layout detection. (And sometimes you'll need the character recognition too.)

I'd bet that commercial OCR packages that are long in the game have unified code for these functions between regular OCR and PDF processing.


OCR is less reliable than looking at the character data directly, if it's available.


Can't wait to try this out with Percollate!


Doesn't Percollate save web pages as PDFs? If you have a web page with tables, you can directly use pandas.read_html to extract them!


I'm always skeptical of these kind of libraries. Whenever I try to use them, it ends up feeling like a broken promise.


Ever researched why it breaks?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: