Many people don't realise the general weird disconnect in PDFs between real content and what you see on the screen that makes it hard to recover source data. In extreme cases you have subset fonts with glyphs ordered completely differently from how they are in the original and no mapping back to the character they represent. Then the graphics stream is instructions to draw glyphs at coordinates. As you can imagine it's quite a battle to get back to something "raw" (assuming you even had fonts to start with).
I so much want to see the day when PDF is dead like Flash.
Sometimes that's what you want (and when the visual appearance is not important, it may make sense to not use PDF), but I definitely wouldn't want to see PDF “dead”.
> With HTML (what you propose) it is hard even to get something to look the same at different browser window widths, let alone different devices or different versions of browsers.
Only if you're trying to use some fancy layout, or if your idea of ‘the same’ is literal. Use a simple ‘text, image, text’ layout like it's the days of HTML 2.0, but with better formatting—and you'll have zero problems reformatting for different displays or reflowing the document into columns. Notice how all popular content sites adopted this layout in the main content column of their pages—and the pages work nicely on both desktop and mobile devices, and are captured fine with Pocket, Evernote and the like.
If you're trying to use a fancy layout for a paper-like publication, the question is why the hell you're doing that.
Actually, the plainest CSS-free HTML renders inexplicably small in some modern flagship phones... I mean the proprietary viewport meta tag, which is in the process of becoming CSS: https://www.w3.org/TR/css-device-adapt-1/
I understand what I believe to be your actual point: it would be nice if documents were more often published in a format that doesn't completely fix their layout and visual appearance. And I agree with that! When I'm reading something purely for its information, and don't care too much about the appearance, I too would like it if it weren't in a visually-fixed format. (That's what I said in the first comment tot: “when the visual appearance is not important, […] not use PDF”.)
But my point is that for the goal of completely fixing the visual appearance, PDF is a pretty decent format (better than say, photographic images of the page), which is why it exists.
When you say you want “the day when PDF is dead”, it appears as though you cannot imagine anyone wanting that goal.
Here are two examples:
Suppose you are an author of books (a physical artefact that will inhabit libraries for centuries; forget about digital displays and all that nonsense for a moment) and care about their typographic quality. Then you will want to make sure of things like:
• that each paragraph contains appropriate line-breaks (http://eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf), so that the page as a whole has a good “texture” or “greyness” (or “colour”),
• that the words have hyphenation (to make the aforementioned good line-breaks possible), but not any poor hyphenation (https://tug.org/docs/liang/),
• that the typefaces chosen are in harmony with each other, that the paper size leads to a good “form factor” for your book, and is appropriate given the kind of binding used, etc.
• and finally, that after you have carefully proofread and verified every line of every page, the reader will not see something totally different, with lines of different widths broken in different places, etc.
Or if you cannot relate to that example, then forget all that, because it's just a special case of a simpler, more general case: suppose you know that your document is ultimately going to be read on paper, and you'd like to make sure it can look the same ten years from now as it does today.
Then PDF (especially PDF/A) is a decent format for this case.
(PS: I've seen very few websites that have good typography in the sense that when printed they approach anything like the quality of a halfway decent book.)
Actually, depending upon how 'obfuscated' an author was attempting to be, you might need that OCR engine itself.
PDF allows for defining arbitrary mappings from byte values to font glyphs. So one could define byte value 32 (decimal, usually ASCII space) to actually map to printing, say, a capitol letter Z instead. One is supposed to provide a reverse mapping table when one does this that says "a decimal 32 byte prints a capitol letter Z" to allow for search and extraction purposes. But the PDF spec. does not require this reverse table be present.
So it is quite possible to randomly assign font glyphs to arbitrary byte values, and omit the reverse mapping table. The result would be that extracting data back out of that PDF results in garbage if one does not know beforehand what the mapping from byte value to glyph was.
So, if a 'bad actor' did this, one's only recourse to retrieving data would be to rasterize the PDF to a bitmap, then OCR the resulting bitmap to extract the content back out.
PDF 2.0 (ISO 32000-2) has been out for a while and supposedly it has eliminated a lot of the cruft from the spec. I just wish it was open like PDF 1.7
Seriously, think about it -- plenty of "professional" webpages with full time designers and UX engineers look like shit on mobile or Firefox or whatever (as you'll know if you've ever read reddit/HN comments). Imagine what a shitshow HTML papers from academics who are desperately doing whatever they can to get a readable version of their paper formatted during the 20 minutes before the deadline would be.
The solution is that you don't use fancy layouts for papers. Imagine that you only have HTML 2.0. Put your text in paragraphs, put images and large formulas in separate paragraphs between those of text. Now chuck that into a ready-made styling template that applies modern typographic conventions. Voila, you have a great-looking article that can be read on a display of any size, be that today or three decades from now. It can be reformatted into columns or stringed into a horizontal ribbon, printed on paper of any size, read aloud by a text-to-speech engine, saved in apps like Pocket or Evernote.
Most popular content sites today use this layout for the main content column, and the pages can be read fine on phones or saved in apps. Markdown readmes on Github use this layout, and it's smooth sailing with them. Pages in HTML 2.0 from the 90s display just fine on modern devices, aside from the different text size.
You don't need to be a designer or make sure that your articles look fine on different devices if you stick to this simple layout and use tested styling. Pages with full-time designers have problems on mobile devices because those people try to do fancy layouts. Don't use fancy layouts for papers. I've spent zero time fixing problems with layouts in Markdown or, by the way, in posts and comments on sites like Reddit and HN, because they don't allow me to do fancy layouts—and they stay readable on phones. If authors have to spend time fixing layouts of their papers, it's because they use too complex layouts which indeed would have problems displaying on different devices.
I sort of doubt it. (But if I'm wrong, please do post it.) Going back to Reddit comments again, even Reddit's version of Markdown (which only allows basic text formatting and simple tables, no images or math notation) is broken as hell in their own official mobile app (at least the iOS one). Tables are screwed up, and even bold/italic is somehow buggy. And there are (probably multiple) engineers paid full time SF salaries to work on Reddit's mobile app.
Also, there would still need to be a canonical print format for this to work with current conference/journal rules, which typically include page limits. And for good reason: nobody except the authors (least of all reviewers) want papers to be any longer than they are. (Sure, you could change to a word or character limit, but then you'd have unlimited images, which would incentivize stuffing tons of information and text into figures and using tons of those. So you'd have to bring in another requirement on total image size, or something. And you can see how this quickly gets overly complicated and you'd really rather just have a simple page limit.)
Here is an example: http://www.cs.utah.edu/plt/scope-sets .
Although some of the models on semantics are pictures generated from latex :p
However, I admit this could be a reasonable way for even impatient and stubborn researchers to publish papers, given the right implementation. I'll withdraw my initial arrogant "I can tell you that ..." :P
(The list of MB's commits is not telling much, but Flatt having written the foreword to MB's other book “Beautiful Racket” is more suggestive.)
PDF is file format presenting fixed-layout documents application independent manner. You don't want to lose universal standard for that.
HTML is markup for presenting documents application dependent manner.
I have dozens of PDFs in my reading queue, for which I'll probably have to buy a tablet. Why can't I read the same columns of text and pictures on my electroink reader when I can do that with HTML? Who the hell knows.
When I was a professor and advising my students on creating portfolios, I told them to build websites of course. But I told them to also have a link to a one-page PDF because many organizations (not just academia) forward resumes within an organization to someone senior who eventually prints it out. And you don't want that person's first impression be whatever your website's print.css churns out.
If you want to have your document printed nicely, just prepare it for printing along with other methods of output. The best way to do it is to not use some crazy layout: have a single column with images between paragraphs, and your documents will look fine on any device. All problems of reformatting documents stem from the rigid two-dimensional layout mentality, while the flexible approach requires stepping back to the one-dimensional semantic flow.
(Actually, standard paper formats were never around, because—surprise—my country doesn't use US paper formats.)
HTML has been an excellent format for delivering data and information across innumerable devices and visual dimensions. That adaptability comes with tradeoffs. As others have pointed out, anyone who's browsed the Internet Archive knows how HTML, beautiful and organized in its own time, can look like slop today. Paper/PDF's tradeoff, of course, is its rigidity.
- Fixed layout seems much easier to handle than dynamic layouts. I.e. I can't recall any website that resizes the content correctly (correctly meaning i see the image within X% of scrolling of the referenced location; that doesn't just make the lines super-long). And without handling this properly, most of the arguments against PDF usage seem to go out the window.
- I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.
- How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)
I rarely have anything like that happen, so not even sure if I know the exact problem that you have in mind. As far as I can tell, it's specific to when authors put images somewhere distant to the text that mentions them in the one dimension of text flow, e.g. on the next page, or floating in a separate column from the text. The solution is, don't put images far from the text. HTML obviously requires a different approach from PDF: you don't think in terms of two-dimensional physical layout, you think in the one dimension of semantic layout. Most popular content sites are laid out that way now, and I mostly have no problem reading on the desktop or the phone.
> I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.
It requires software just as PDF requires it. Such software isn't ubiquitous precisely because people don't see HTML annotation as a market. It's a typical chicken-and-egg market problem.
To annotate HTML, you abandon the two-dimensional graphical approach just as you do it when producing the document. Instead, you highlight text in paragraphs and attach annotations and drawings to it, independently of the current rendering of the document. Any word processor allows you to highlight text in lines and paragraphs, you do the same thing here. Evernote's web clipper highlights HTML just fine.
> How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)
You use a format that packs HTML with images, styling and fonts—e.g. MAFF. Come on, it's not rocket science to store what the server sends to the browser. Again, Evernote stores pages fine and could be used for sharing (if the program didn't go to crap overall). It's the same chicken-and-egg problem.
Not even trying to be funny but do you mind sharing some websites that dynamically resize content correctly? I just checked a couple of the usual suspects (reuters, nytimes, guardian, github) and none do it. They are all using (semi-)fixed layouts.
Remember that we're talking about publication of static papers, so you look at the main content column on a page, since that's what should be there in a paper. In the main column, those sites use the simple linear flow: ‘text, image, text’—with images occupying entire paragraphs instead of floating to the sides. With this layout, you can reformat articles every which way, string them into horizontal pages, render them in columns or read them with text-to-speech, etc. It's essentially HTML 2.0 layout but with better formatting. Markdown readmes on Github are the perfect example of this approach.
I've regularly used Evernote for capturing web pages, and Pocket to read them on the phone, and they have no problem with storing main content from such articles, stripped of extraneous navigation (outside of Pocket's bugs with dropping some content, presumably from overzealous anti-ad measures).
You don't look at images outside of the main content column for this discussion, because those aren't what should be there in static paper-like publications—unless the images are related to the content. And if the images are related to the content, the question is why the author is trying to use a fancy layout for such a publication.
(NYTimes do sometimes use more complex layouts in feature articles, with dynamic effects—but they, presumably, don't target those for long-term archival, and instead they customize the pages for mobile and desktop access separately. Anyway, they also should tone that down if they want readership via something like Pocket.)
I most often have problems with images on Wikipedia, because they make images float to the right side since they have many non-essential but illustrative images. Those, indeed, tend to detach from the relevant text.
Did you mean “zooming” the page in/out on the same device? That's not a big issue, in my experience: I zoom in on almost every page due to myopia, and rarely have problems. I adjust text properties on mobile devices too, namely in Pocket and e-book readers (which use HTML under the hood these days). Technically, HTML can be rendered with a rigid layout and just be zoomed in/out like a static image—it's a question of the client having this function, or, I think, can be done via a simple CSS property.
If that's still not what you had in mind, I'd like to know what you mean by “resize,” out of professional curiosity.
Since you are curious due to professional curiosity: what i meant by resize is the utilization of the device's screen.
If my screen allows for a 1200px wide browser window, the main content shouldn't use 800px of it. On my 5000px wide screen, nytimes.com articles seem to utilize a whooping 10-15% (i am guessing). Might as well just send me a fixed-layout PDF.
That being said, I doubt it is computationally easy to compute a good layout. Considering how slow latex compiles a PDF, trying to find the optimal layout for a non-rigid layout seems difficult with the time constraint at hand.
If you're doing a lot of reading, you would do better by having your screen in portrait orientation. Wide screens are better suited for other tasks.
I'm tempted to note, however, that HTML with a simple layout, again, can technically be hammered into displaying in several columns on a wide screen. You'd probably want/need site-specific solutions if you want to keep the site's navigation. But if you need only the main content, you could use an extension akin to the “reading mode” of Firefox/Safari/Pocket, and override the CSS to break content into columns. (There might also be such extensions around that already have columns built in.)
There is no standard and widely recognized long term archival format for HTML pages (with all the extras). Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.
> Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.
Not enough how? What is there that you need besides what the server hands to you, if that's what rendered in the first place? What magical compression methods do you have in PDF that are better than ZIP compression used in MAFF, for example?
Have a static HTML version that's rendered the same in the future. You know, the same way that you have a static PDF standard.
HTML is not a good format and standard for that purpose. It's loose best effort markup with no good consensus on semantics. HTML with images is not good option for papers which have many equations.
EPUB3 is emerging standard for what you want, but it's not really good complete solution that can replace PDF/A or TeX/LaTeX
> Have a static HTML version that's rendered the same in the future
We don't have that.
And PDF has good semantics? Are we still on the topic of how HTML is better than PDF, or…? We're in the comments for a page that says that PDF tables are characters just floating in space, and people are saying most PDFs out there don't have semantic markup. Meanwhile HTML had semantics efforts for decades now, just choose your flavor.
Blind people read HTML, you know. Do they read PDFs?
> HTML with images is not good option for papers which have many equations.
There's MathML for that, and IIRC other formats too. You could even have embedded TeX like Anki has. Use SVG for fallback.
>> Have a static HTML version that's rendered the same in the future
> We don't have that.
Ooh, chicken-and-egg again? Freeze any of the versions from the past decade with the rendering standards, and you'll have it.
But actually, it doesn't even matter, just like HTML 2.0 can be rendered fine on modern devices (aside from the different text size). Treat your paper as a paper instead of a webzine, don't use crazy layouts, just do “text, image, text” which you'll want anyway for the different displays—and your document will render fine in the future when it will be delivered straight to the retina, instead of making me scroll the PDF back and forth because no reflow.
Totally agree. A while ago I had to write code to import a ton of PDF files and it was just infuriating to realize that we have data in highly structured documents, throw all structure away to create a PDF and then we somehow have to divine that structure back from the PDF with enormous effort and only partial success. It's just a horrible, horrible file format for what it's used now.
btw, I think the pip install requirements missed opencv-python (on Windows?). And in this doc , it should be "top left and bottom right" instead of "left-top and right-bottom".
You should use "pip install camelot-py[all]" to install Camelot (which will install opencv-python too). I had to take it out of the requirements since it wasn't available in any conda channels while I was creating the conda package. I'm looking to remove opencv as a requirement altogether by either vendorizing the opencv code that is being used inside Camelot or reimplementing the code using something lightweight like pillow.
Thanks for the catch in , I'll correct it!
for example, this is my sample piece of code to extract data from Aadhaar signed PDF https://pastebin.com/dg8p98T1
This library works perfectly and could've saved me a lot of time! Looking at some of the source code, we used similar logic to parse the tables. Pretty neat!
My go-to solution has been 'pdftotext -layout' with a bit of hackery before giving it to pandas.read_fwf. That usually gets me 80% of the way there 80% of the time. The upside is that this tends to fail "better" than some other options.
I look forward to kicking-the-tires with this on my test cases.
You can simple do: camelot --output data.xlsx --format excel lattice input.pdf (lattice can be replaced with stream based on the type of tables in your PDF)
Did you try HoughLinesP?
Returns line segment endpoints with a probabilistic Hough Transform. I'm fully confident your solution works, just wondering if you tried this and why it was rejected.
If you managed to vendor a small portion of OpenCV that contained image i/o, basic colorspace conversion, thresholding, scaling/rotating, shape drawing/insertion, HoughLines and findContours, I think you could release that as its own package and it would be quite popular. OpenCV is such a bloated dependency...
I have used Airflow in the past to create ETL pipelines, and plugged in Camelot in one of them to extract tables from PDFs. I also wrote a blog post about it in case you might be interested. https://hackernoon.com/how-to-create-a-workflow-in-apache-ai...
Thank you for the pointers!
I've often wondered if image semantic segmentation methods as used in the ML community could successfully identify things like "there is a table (or figure) here, it's not part of the main text". I mean, it seems that humans should be able to do this even without reading the text so I don't see why a CNN couldn't.
Is the library able to handle cells that span multiple columns?
Yes, Camelot takes care of cells spanning multiple columns! You can check out the Advanced Usage section for explanation on the keyword arguments I used in the gist! https://camelot-py.readthedocs.io/en/master/user/advanced.ht...
Did this: qpdf --decrypt input.pdf output.pdf
>The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.
We use it in Polar:
for our PDF management.
It's a pretty robust library and it renders everything on canvas BUT you also get the raw text in the DOM so you can play with it more as an API for managing PDFs.
REALLY nice to be able to use web standards when working with pdf.js.
The downside is that the graphics are rendered to canvas so you're only really getting an image.
If you have any pointers in the OCR route, do suggest them here, or on this GitHub issue! https://github.com/socialcopsdev/camelot/issues/101
I'd bet that commercial OCR packages that are long in the game have unified code for these functions between regular OCR and PDF processing.