Edit: To add a little more color, given that none of us was (or at least certainly I wasn't) an expert on the PDF format, we had so far treated the bug like a bug of probably at-most moderate complexity (just have to read up on PDF and figure out what the base unit is or whatever). After discovering what this article talks about, it became evident that any solution we cobbled together in the time we had left would really just be signing up for an endless stream of it-doesn't-work-quite-right bugs. So, a feature that would become a bug emitter. I remember in particular considering one of the main use cases: scientific articles that are usually in two columns, AND also used justified text. A lot of times the spaces between words could be as large as the spaces between columns, so the statistical "grouping" of characters to try to identify the "macro rectangle" shape could get tricky without severely special-casing for this. All this being said, as the story should make clear, I put about one day of thought into this before the decision was made to avoid it for 1.0, so far all I know there are actually really good solutions to this. Even writing this now I am starting to think of fun ways to deal with this, but at the time, it was one of a huge list of things that needed to get done and had been underestimated in complexity.
The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are hundreds of billions of dollars going into self-driving cars, and like zero dollars going into this problem.
PDF is de-facto standard for any invoicing, POs, quotes, etc.
If you solve the problem you can effectively programmatically deal with invoicing/payments/ large parts of ordering/dispensing. It's a no brainer to add it on to almost any financial/procurement software that deals with inter business stuff.
Any small-medium physical business can probably half their financial department if you can dependably solve this issue.
As far as I understand there are at least two standards (I know of in Germany): XRechnung and ZUGFeRD/Factur-X (which is PDF A/3 with embedded XML).
Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process, or adopting interchange formats.
Problem for some small-business is the cost (process changes, licencing etc) of adopting interchange formats and working with large vendors is prohibitive at their scale e.g. the airline BSP system.
I agree that solving the problem generally i.e. replacing an accounts payable staff person capable of processing arbitrary invoice documents will be comparable to self-driving in difficulty.
If a company deals with a lot of a single type of PDF, then the approach could be economical. I am actually involved in a project looking at doing this with AWS Textract.
Building machines that understand formats that are understood by humans is exactly what we should be doing. People should read, write, and process information in a format that is comfortable and optimized to them. Machines should bend to us, we should not bend to them.
If businesses only dealt with machine readable formats, everyone's computer would still be using the command line.
And there's real condescension in your post:
> Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process
You're saying that businesses need to change their business to accommodate data formats, but it should be the other way around.
The proliferation of computers in business over the last 50 years is precisely because businesses can save money/expand capacity by adapting the business processes to the capabilities of the computers.
Over that time, computers have become more friendly to humans, but businesses have adapted and humans been trained to use what computers can do.
You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.
This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.
While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.
For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.
- Better search engine results
- Identifying experts within a company
- Better machine translation
- Finding accounting fraud
- Automating legal processes
For context, the reason why Facebook is the most successful social network is that they're able to turn behavioral residue into content. If you can get better at taking garbage data and repackaging it into something useful, it stands to reason that there are lots of other companies the size of Facebook that can be created.
There's almost no question in my mind that most new data will endure in some form, by virtue of being digital from day 1.
The endgame for such a company, imho, is to become the "source entity" of information management (in abstracted form), whose two major products are one to express this information in the digital space, and the other in the analog/physical space. You may imagine variations of both (e.g. AR/VR for the former).
Kinda like language in the brain is "abstract" (A) (concept = pattern of neurons firing) and then speech "translates" into a given language, like English (B) or French (C) (different sets of neurons). So from A you easily go to either B or C or D... We've observed that Deep Learning actually does that for translation (there's a "new" "hidden" language in the neural net that expresses all human languages in a generic form of sorts, i.e. "A" in the above example).
The similarities of the ontology of language, and the ontology of information in a system (e.g. business) are remarkable — and what you want is really this fundamental object A, this abstract form which then generates all possible expressions of it (among which a little subset of ~1,000 gives you human languages, a mere 300 active iirc; and you might extend that into any formal language fitting the domain, like engineering math/physics, programming code, measurements/KPI, etc.
It's a daunting task for sure but doable because the space is highly finite (nothing like behavior for instance; and you make it finite through formalization, provided your first goal is to translate e.g. business knowledge, not Shakespeare). It's also a one-off thing because then you may just iterate (refine) or fork, if the basis is sound enough.
I know it all sounds sci-fi but having looked at the problem from many angles, I've seen the PoC for every step (notably linguistics software before neural nets was really interesting, producing topological graphs in n dimensions of concepts e.g. by association). I'm pretty sure that's the future paradigm of "information encoding" and subsequent decoding, expression.
It's just really big, like telling people in the 1950's that because of this IBM thing, eventually everybody will have to get up to speed like it's 1990 already. But some people "knew", as in seeing the "possible" and even "likely". These were the ones who went on to make those techs and products.
Legal technology. Pretty much everything a lawyer submits to a court is in PDF, or is physically mailed and then scanned in as PDF. If you want to build any technology that understands the law, you have to understand PDFs.
Changing their process might be more expensive than paying a lot of money for them to carry on as is for a few more years while getting the benefit of modern eyes on their content.
Edit: concrete example would be government publications like budget narrative documents.
PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned. Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.
I've actually seen obfuscation used in a PDF where they load in a custom font that changes the character mapping, so the text you get out of the PDF is gibberish, but the fonts displayed on rendering are correct (a simple character substitution cipher).
The important thing to remember whenever you think something should be simple, is that someone somewhere has a business need for it to be more complicated, so you'll likely have to deal with that complication at some point.
If you care to check us out: https://siftrics.com/
I've often thought about creating products like these but as a one-man operation I am daunted by the "getting customers" part of the endeavour. How do you get a product like this into the hands of people who make the decisions in a business? (For anyone, not just OP). PPC AdWords campaigns? Cold-calling? Networking your ass off? Pay someone? Basically, how does one solve the "discoverability problem"?
In the meantime, if you have any questions, feel free to send me an email at firstname.lastname@example.org. I’d love to hop on the phone or do a Zoom meeting or a Google Hangouts.
Feel free to email me at email@example.com with any questions. We can setup a phone call, zoom meeting, or google hangouts too, if you’d like.
The page is clear and easy to understand, looks good. Well done.
I like to say that anyone with a good old ThinkPad and an internet connection can mint fortunes and build empires :-)
I am especially thinking about Japanese. Our company could probably find good uses of such service if it had Japanese support.
If you have any questions or need help trying it out, please email me at firstname.lastname@example.org. We can hop on the phone, too, if you'd like.
Of course, there is absolutely no semantics, just display.
I would believe that. It was a pretty poor obfuscation method as they go, if it was intended for that.
> Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.
Oh, I did. That's the flip side of my second paragraph above. When there's a business need to work around complications or obfuscations, that will also happen. :)
Can't stress this enough. The next time you open a multi-column PDF in adobe reader and it selects a set of lines or a paragraph in the way you would expect, know that there is a huge amount of technology going on behind the scenes trying to figure out the start and end of each line and paragraph.
Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.
The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).
The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.
There's a lot more than zero dollars going into this... it's just that the end result is universally something that's "good enough for this one use-case for this one company" and that's as far as it gets.
The basic issue imho is that NLP algorithms are very inaccurate even with perfect input. E.g. even with perfect input, they're maybe only 75% accurate. And even an a text-processing algorithm that's like 99.9% accurate will yield input to your NLP algorithms that's like 50% accurate, so any results will be mostly unusable.
At least :)
Can you explain a bit more about why this is so valuable? I don't know anything about this industry.
Why wouldn't you just zoom with the center point being where the tap occurred?
- Use https://github.com/flexpaper/pdf2json to convert the PDF in an array of (x, y, text) tuples
- Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.
Here is the section of our Dockerfile that builds pdf2json for those of you that might need it:
# Download and install pdf2json
RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \
&& cd $HOME/pdf2json-$PDF2JSON_VERSION \
&& wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \
&& tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \
&& ./configure > /dev/null 2>&1 \
&& make > /dev/null 2>&1 \
&& make install > /dev/null \
&& rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \
Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.
I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...
Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....
Most programming languages offer a regex engine capable of matching non-regular languages. I agree though, if you are actually trying to _parse_ text then a regex is not the right tool. It just depends on your use case.
I have used this in the past to extract tables, but it doesn't help much in cases where you need font size information.
pdftotext -layout -nopgbrk -eol unix -f $firstpage -l $lastpage -y 58 -x 0 -H 741 -W 596 "$FILE"
A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).
As someone else said, this is like black magic, but kind of fun :)
In the GNU Awk User's Guide:
Tracking column and field widths across page breaks is ... interesting, but more tractable.
I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!
We need some metadata to rearrange and sort PDF pages for mailing and delivery (such as name, address, and start/end page for that customer).
Our general rule is you provide metadata in an external file to make it easy for us. Otherwise, we run pdftotext and hope there's a consistent formatting for the output (e.g. every first page has "Issue Date:", "Dear XYZ,", or something written on it).
If that doesn't work then we're re-negotiating. It is not too difficult usually to build a parser for one family of PDF files based on a common setup as you've said and you get to learn various tricks. It is very difficult though to write a general parser.
Personally, I found parsing postscript easier since usually it was presented linearly.
Good thing about this was as you have already outlined: It allowed for some flexibility in what was acceptable input data. For specific address formats or names we could accept multiple formats as long as they were consistent and in the proper position in the input file.
Regarding renegotiating: We didn't get that far. However, if a customer within our organization was enlisting our expertise and could not produce an acceptable input file, then we would go back to them and explain the format that we require in order to generate the necessary documents. Of course, creating our document through our data pipelines is obviously the better choice, but this was not an option in some cases at the time.
As far as doing the work of creating these documents in a tool like Planetpress is concerned, well, don't use Planetpress. You are better of doing it in your favorite language of choice's libraries tbh. Nothing worse than having to use proprietary code (Presstalk/Postscript.) that you have to learn and never be able to use anywhere.
The problem we have with a lot of client files is that they look fine but printers don't care about "look fine", they crash hard when they run out of virtual memory due to poor structure. And usually without a helpful error message, so that's more billable hours to diagnose. The most common culprit is workflows that develop single document PDFs then merge them resulting in thousands of similar and highly redundant subset fonts.
Also, how do you manage things when one of those banks decides to change the layout/format?
The experience helped me to roll out an API, as https://extracttable.com, for developers.
OCR tricks? Assuming post processing dev stuff - may I know your OCR engine. We are supported with Kofax and openText along with cloud engines like GVision as a backup.
The PDF standard is a mess, and the number of 'tricks' I've seen done is astonishing.
Example: to add shade or border effect to text, most PDF generators simple add the text twice with a subtle offset and different colors. Result: your SaaS service returns every sentence twice.
Off course there were workarounds, but at some point it became unmaintanable.
The demand for automating text extraction is still very high — or at least it feels like it when you’re working around the clock to cater to 3 of your customers, only to wake up to 10 more the next day. We’re small but growing extremely quickly.
I've bookmarked your site for future research... but the aviation part has me curious!
It’s definitely harder to get government business because the sales process is so long and compliance is so stringent. That said, we are GDPR compliant.
Parsing statement PDFs from every bank is pretty hellish.
In the past we did purposely make it more difficult to parse our PDFs
I managed to find your email address from your GitHub profile. Going to send you an old fashioned email.
# Parser drift and maintenance hell
Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.
You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.
A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.
A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:
pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)
pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)
pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)
What's the common factor between these version changes? The return address is determining the version.
PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.
# PDFs that aren't standards compliant
In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.
TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.
I have used Google's fine OCR results to simulate a hacker.
- Download a youtube video that shows how to attack a server on the website hackthebox.eu
- Run ffmpeg to convert the video to images.
- Run a jpeg to pdf tool.
- Upload the pdf to google drive.
- Download the pdf from google drive.
- Grep for the command line identifiers "$" "#".
- Connect to hackthebox.eu vpn.
- Attack the same machine in the video.
By the way, why do you wait 10 minutes? Is there a signal that the PDF is done processing?
Or is there just some kind of voodoo magic that seems to happen that just takes 10 minutes to do?
Also, I am not aware of a signal when it is done.
Now I think about it, I don't know what you mean by "upload a pdf to google drive and download it 10 minutes later".
Uploading and downloading a file shouldn't change it at all, at bit level.
ImageMagick. convert *.jpg out.pdf
It's not that the thing you're trying to do is stupid. It's probably entirely legitimate, and driven by a real need. It's just that the original designers of the thing you're trying to work on didn't give a damn about your ability to work on it.
I'm not sure (I haven't thought about it a lot) that you could come up with a format that duplicates that function and is also easier to parse or edit.
QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript.
(And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.)
What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support.
Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew?
i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard.
(As for who’s to blame for that, I leave others to join up the dots.)
Currently, there is no viable alternative if you want the pros but not the cons
I remember OpenXPS being much easier to work with. That might be due to cultural rather than structural differences, mind - fewer applications generate OpenXPS, so there's fewer applications to generate them in their own special snowflake ways.
The problem with this is that from an average person perspective it doesn't have the pros. There is no built-in or first-party app that can open this format on Mac and Linux.
More than 99% of the users only want to read or print it. It's hard to convince them to use an alternative format when it's way more difficult to do the only thing they want to do.
Once you have something as a PNG (or any other format you can get into a Bitmap), throwing it against something like System.Drawing in .NET(core) is trivial. Once you are in this domain, you can do literally anything you want with that PDF. Barcodes, images, sideways text, html, OpenGL-rendered scenes, etc. It's the least stressful way I can imagine dealing with PDFs. For final delivery, we recombine the images into a PDF that simply has these as scaled 1:1 to the document. No one can tell the difference between source and destination PDF unless they look at the file size on disk.
This approach is non-ideal if minimal document size is a concern and you can't deal with the PNG bloat compared to native PDF. It is also problematic if you would like to perform text extraction. We use this technique for documents that are ultimately printed, emailed to customers, or submitted to long-term storage systems (which currently get populated with scanned content anyways).
pdftk form.pdf multibackground additions.pdf output output.pdf
Not even when they try to select and copy text?
From my experience, it seems to grab text just fine, the tricky part is identifying & grabbing what you want, and ignoring what you don't want... (for reasons mentioned in the article)
Sounds like you are in need of OCR if you want to be able to use arbitrary screen coords as a lookup constraint.
Epubs in comparison are easy, as all it takes is a single tap or button press to continue. When there's no DRM on the file (thanks HB, Baen) I read in FBReader with custom fonts, colors, and text size. It doesn't hurt any that the epub files I get are usually smaller than the PDF version of the same book.
Personally, I think the fact that Calibre's format converter has so many device templates for PDF conversion says a lot.
 recently read an SEO post on okta's site. who can read that garbage?
 only GA ... which isn't a 3rd-party tracker.
Why not? It's not self-hosted and results are stored elsewhere.
some might add: for the purpose of resale of the data, but I don't think that's a requirement to be classified as 3rd party tracker. the mere act of correlation, no matter what you then do with the data, makes you a 3rd party tracker. in case you think that's just semantics, this is important for GDPR and the new california law.
you can turn on the "doubleclick" option, which does do said correlation and tracks you. but that's up to the site to decide. GA doesn't do it by default.
When you're done filling the form, the PDF runs form validity checks and generates a 2D barcode  -- which stores your all field entry data -- on the first page. This 2D barcode can then be digitally extracted on the receiving end with either a 2D barcode scanner or a computer algorithm. No loss of fidelity.
Looks like Acrobat supports generation of QR, PDF417 and Data Matrix 2D barcodes.
The Canadian tax agency offers free storage for whatever receipts you mail them? Sounds nifty. Does the IRS (or any other tax agency) do this?
As noted in the article, it is extremely difficult to figure out the original text given only a "normal" PDF, so you end up using a lot of heuristics that sometimes guess correctly. There's no guarantee that you'll be able to extract the "original text" when you start with an arbitrary PDF without embedded data. So if you're extracting text, neither way guarantees that you'll get "original text" that exactly matches the displayed PDF if an attacker created the PDF.
That said, there's more you can do if you have an embedded OpenDocument file. For example, you could OCR the displayed PDF, and then show the differences with the embedded file. In some cases you could even regenerate the displayed PDF & do a comparison. There are lots of advantages when you have the embedded data.
Is there a special "data" section of the PDF that includes this?
Can you point me to any documentation regarding this? It sounds quite good TBH.
Using white-on-white dark-hat SEO techniques for keyword boosting? Check. Custom fonts with random glyphs? Check. I didn't see custom encodings (yet).
We try to keep HTML semantic, but google has been interpreting pages to a much higher level in order to spot issues such as these. If you ever tried to work on a scraper, you know how it's very hard to get far nowdays without using a full-blown browser as a backend.
What worries me is that it's going to get massively worse. Despite me hating HTML/web interfaces, one big advantage for me is that everything which looks like text is normally selectable, as opposed to a standard native widget which isn't. It's just much more "usable", as a user, because everything you see can be manipulated.
We've seen already asm.js-based dynamic text layout inspired by tex with canvas rendering that has no selectable content and/or suffers from all the OP issues! Now, make it fast and popular with WASM...
Though absolute-positioning of all text elements via CSS at some arbitrary level (I've seen it by paragraph), such that source order has no relationship to display order, is quite close.
I started reading the ISO spec for postscript used in modern PDFs. You can read it yourself here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...
What actually needs to be done to extract text correctly is to be able to parse the postscript, have a way of figuring out how the raw text.. or the curves that draw the text.. are displayed (whether they are or not and in relation to each other) using information that the postscript gives you.
Edit: More than anything I think understanding deeply the class of PDFs you want to extract data from is the most important part. Trying to generalize it is where the real difficulty comes from.. as in most things.
I didnot get time to enhance it further but planning to containerize the whole application. See if you find it useful in its current form.
Not a fan of the potential vendor lock in though, so it's only really suitable for those in an already AWS environment not worried about them harvesting your data.
* Polish ebooks, which usually use Watermarks instead of DRM, sometimes hide their watermarks in a weird way the screen reader doesn't detect. Imagine hearing "This copy belongs to address at example dot com: one one three a f six nine c c" at the end of every page. Of course the hex string is usually much longer, about 32 chars long or so.
* Some tests I had to take included automatically generated alt texts for their images. The alt text contained full paths to the JPG files on the designer's hard drive. For example, there was one exercise where we were supposed to identify a building. Normally, it would be completely inaccessible, but the alt was something like "C:\Documents and Settings\Aneczka\Confidential\tests 20xx\history\colosseum.jpg".
* My German textbook had a few conversations between authors or editors in random places. They weren't visible, but my screen reader still could read them. I guess they used the PDF or Indesign project files themselves as a dirty workaround for the lack of a chat / notetaking app, kind of like programmers sometimes do with comments. They probably thought they were the only ones that will ever read them. They were mostly right, as the file was meant for printing, and I was probably the only one who managed to get an electronic copy.
* Some big companies, mostly carriers, sometimes give you contract templates. They let you familiarize yourself with the terms before you decide to sign, in which case they ask you for all the necessary personal info and give you a real contract. Sometimes though, they're quite lazy, and the template contracts are actually real contracts. The personal data of people that they were meant for is all there, just visually covered, usually by making it white on white, or by putting a rectangle object that covers them. Of course, for a screen reader, this makes no difference, and the data are still there.
Similar issues happen on websites, mostly with cookie banners, which are supposed to cover the whole site and make it impossible to use before closing. However, for a screen reader, they sometimes appear at the very beginning or end of the page, and interacting with the site is possible without even realizing they're there.
With that in consideration, and the existing resources are little help especially on skewed, blurry, handwritten and 2 different table structure in the input, I ended up creating an API service to extract tabular data from Images and PDFs - hosted as https://extracttable.com . We cared it to be robust, average extraction time on images is under 5 seconds. On top of maintaining accuracy, A bad extraction is eligible for credit usage refund, which literally not any service offer it.
i Invite HN users to give it a try and feel free to email email@example.com for extra API credits for the trail.
Feel free to reachme at manuel at jazzido dot com
Am I reading the repos correctly? It looks like Extractable copied Tabula (MIT) to its own repo rather than forking it, removed the attribution, and then tried to re-license it as Apache 2.0. If so, that would be pretty fucked up.
Still, I would have loved at least a heads up from the team that sells Tabula Pro. I know they're not required to do so, but hey, they're kinda piggybacking on Tabula's "reputation".
What do you recommend us to do, to not make you feel we made a dick move.
Did you ask permission of the original author to use a derived name?
Did you discuss your plan to commercialize the original author's work with the author? Before starting out?
Since starting a commercial project, how much money have you given to the original author?
"commercialize the original author's work with the author"
- No, but let me highlight this, any extraction with tabula-py is not commercialized - you can look into the wrapper too :) or even compare the results with tabula-py vs tabulaPro.
Copying the TabulaPro description here, "TabulaPro is a layer on tabula-py library to extract tables from Scan PDFs and Images." - we respect every effort of the contributors & author, never intended to plagiarize.
I understand the misinterpretation here is that we are charging for the open-sourced library because of the name. We already informed author in the email about unpublishing the library, this morning, I just deleted the project and came here to mention it is deleted :)
It may be that you didn't realize that people would see your appropriation as wrong, although I have a hard time believing that as well given that the author tried to contact you and was ignored. As they say, "The wicked flee when no man pursueth."
So what I see here is somebody knowingly doing something dodgy and then panicking when getting caught. If you'd really like to make amends, I'd start with some serious introspection on what you actually did, and an honest conversation with the original author that hopefully includes a proper  apology.
 Meaning it includes an explicit recognition of your error and the harms done, a clear expression of regret, and a sincere offer to make amends. E.g., https://greatergood.berkeley.edu/article/item/the_three_part...
I would like to get the author's comment on "tried to contact you and was ignored", as I was the one who emailed yesterday.
You're "handwritten" example looks a bit "too decent" as well. I can see how that works. You first look for the edges of the table, and then you evaluate the symbol in each cell as something that matches unicode.
So, how well does this cope with increasing degradation? i.e. pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?
"The Worst Image" is a close match to that, except it is a print.
Regarding increasing degradation - as stated above, the OCR engine is not proprietary - we confined ourselves to detect the structure at this moment, and started with the most common problems.
And meanwhile if you say on HN that HTML should be used instead of PDF for papers, people will jump on you insisting that they need PDF for precise formatting of their papers—which mostly barely differ from Markdown by having two columns and formulas. What exactly they need ‘precise formatting’ for, and why it can't be solved with MathML and image fallback, they can't say.
People feeling the urge to defend PDF might want to pick up at this point in the discussion: https://news.ycombinator.com/item?id=21454636
PDF is simply better than HTML when it comes to preserving layout as the author intended.
First, there's no need to stretch my argument to the point of it being ridiculous. I don't have to reach for a watch to suffer from PDF. Even a tablet is enough: I don't see many 14" tablets flying off the shelves. I also know for sure that the vast majority of ubiquitous communicator devices, aka smartphones, are about the same size as mine, so everyone with those is guaranteed to have the same shitty experience with papers on their communicators and will have to sedentary-lifestyle their ass off in front of a big display for barely any reason.
> Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes. PDF is simply better than HTML when it comes to preserving layout as the author intended.
As I wrote right there above, still no explanation of why the layout makes that difference and why preserving it is so important when most papers are just walls of text + images + some formulas. Somehow I'm able to read those very things off HTML pages just fine.
I think it's the artist in them.
This is understandable in some instances:
a. Picasso's or Monet's works probably wouldn't be as good if you just roll them up into a ball. Sure, the component parts are still there (it's just paper/canvas and paint after all!) but the result isn't what they intended.
b. A car that has hit a tree is made up of the composite parts but isn't quite as appealing (or useful) as the car before hitting the tree.
c. A wedding cake doesn't look as good if the ingredients are just thrown over the wedding party's table. The ingredients are there, but it just isn't the same...
Presentation is sometimes important.
This conjecture would have some practical relevance if I had access to the same papers in other formats, preferably HTML. Yet I'm saddened time and again to find that I don't.
In fact, producing HTML or PDF from the same source was exactly my proposed route before I was told that apparently Tex is only good for printing or PDFs. I hope that this is false, but not in a position to argue currently.
It is worrying if places that are “libraries” of knowledge aren’t taking the opportunity to keep searchable/parseable data, but it’s no worse than a library of books.
That's not my complaint in the first place. The problem is that while we progressed beyond books on the device side in terms of even just the viewport, we seemingly can't move past the letter-sized paged format. The format may be a bit better than books—what with it being easily distributed and with occasionally copyable text—but not enough so.
I'm not even touching the topic of info extraction here, since it's pretty hard on its own and despite it also being better with HTML.
They key difference is it contains no ambiguity (such as all fonts must be embedded).
You're right that most of the relevant semantics would fit into Markdown. So store the markdown! There are problems with PDF but HTML is the worst of all worlds.
If you're talking about images and whatnot falling off, that's a problem of delivery and not the format.
Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.
I am indeed, and of other tags that are no longer supported. Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact. And while the oldest generation of sites embraced the reflowing of HTML, the CSS2-era sites did not, so it's not at all clear that they will be usable on different-resolution screens in the future.
> Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.
This is one of those things that sounds easy but is impossible in practice. Unless you can clearly define the line between which features should be used and which should not, you'll end up with all of the features of HTML being used, and all of the problems that result.
You'll notice that I said in the top-level comment that my beef is with PDF papers (i.e. scientific and tech). I don't care about magazines and such, since they obviously have different requirements. So let's transfer your argument to current papers publishing:
You can decide for yourself whether this corresponds to reality, and if the hijinks of CSS2-era websites are relevant.
What is it with people, one after another, jumping to the same argument of ‘if authors have HTML, they will immediately go bonkers’? If really looks like some Freudian transfer of innate tendencies. We have Epub, for chrissake, which is zipped HTML—what, have Epub books gone full Dada while I wasn't looking? Most trouble with Epub that I've had is inconvenience with preformatted code.
> Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact.
Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.
Those things don't happen in PDFs in the wild, or at least not to any great extent. It's not that technical paper authors have shown some special restraint and limited themselves to a subset of what the rest of the PDF world does. Technical papers look much like any other PDF and do much the same thing that any other PDF does; if they were authored in HTML, we should expect them to look much like any other HTML page and do much the same thing that any other HTML page does. Based on my experience of HTML pages, that would be a massive regression.
> Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.
The details matter; you can't just handwave the idea of a sensible set of restrictions and a good packaging format, because it turns out those concepts mean very different things to different people. If you want to talk about, say, Epub, then we can potentially have a productive conversation about how practical it is to format papers adequately in the Epub subset of CSS and how useful Epub reflowing is versus how often a document doesn't render correctly in a given Epub reader. If all you can say about your proposal is "a subset of HTML" then of course people will assume you're proposing to use the same kind of HTML found on the web, because that's the most common example of what "a subset of HTML" looks like.
This makes zero sense to me. You're saying that technical papers look the same as Principia Discordia or glamor/design magazines or advertising booklets, including those that just archive printed media. That technical papers include web-form functionality just like some PDFs do—advertising or whatnot, I'm not sure. If that's the reality for you then truly I would abhor living in it—guess I'm relatively lucky here in my world.
However, if you point me to where such papers hang out, I would at least finally learn what mysterious ‘complex formatting’ people want in papers and which can only be satisfied by PDF.
We need the opposite, we need a format that stays the same size, same proportions and is vectorized so you can zoom to any size - however, the relationship of space between elements remains constant.
PDF is an amazing format IMO. Think of it like Docker - the designer knows exactly how its going to appear on the user's device.
If you have opposing thoughts, please elaborate further instead of simply saying "nothing to do with it". HN works by explaining and arguing about issues to get to the bottom of something.
Sorta like this then, more rigidly conveying the designer's intended layout: https://bureau.rocks/projects/book-typography-en/ ?
(Though these ‘books’ are almost the opposite of what I'm advocating for, in terms of formatting.)
Statements of the form "So you're saying [obviously stupid thing]?" still break the site guidelines, though.
> I don't think that's a problem in HTML. I think the problem is in your head.
This directly addresses fermienrico's complaint as being targeted at HTML. The ‘in your head’ part says that the problem is imaginary and that the person's reasoning and motivation in arriving at that connection is a mystery to me. So there are two meanings combined, both of which aren't ad hominems. The statement also deliberately invokes the words of the character of Preobrazhensky from Bulgakov's ‘Heart of a Dog’, on the condition of the fresh-born Soviet Union: “The Disruption isn't in the lavatories, it's in the heads”.
Let's inspect the statement more closely. The use of the second-person ‘you’ in hypothetical constructs is ubiquitous in English, instead of a third-person ‘a man’ or ‘one’, e.g.: “When you try reading PDF on a phone, you experience unspeakable horror and loss of all hope”—this doesn't imply that the addressed correspondent is the one doing this, and in fact may be directed at multiple unknown readers or listeners.
The hyperbole of ‘you're nuts’ is, to my knowledge, also a typical feature of colloquial English language, e.g.: “You must be crazy to try reading PDF on a phone”, or “What's wrong with you, that tablet is too small to display PDF adequately”. These both don't mean that the correspondent is literally mentally damaged, but that the speaking party doesn't understand their reasoning or doesn't agree with it.
My choice of words there is rather harsh, yes. Why I needed that is, I've had this same discussion before and I repeatedly failed to extract from people the reasons why they make this inference. I tried different approaches, and now came the time of directly placing the person in the shoes of an author. Still nothing so far.
On top of all this and nitpicking further, even disregarding the above I still can't quite fit the statement in a ‘personal attack’ category: as I understand it, an ‘ad hominem’ works by carrying a belief ‘the person has some bad quality A’ over to ‘the person's opinion B hence must be wrong’. In the case of ‘you must be insane/naive to have the opinion B’ no other personal qualities are involved, and the opinion B is directly stated to be wrong. Possibly uncivil and possibly unsupported yes, personal no.
P.S. Could someone please make the ‘collapse comments’ button-link larger on phones? It's even worse on a higher-PPI display, easily taking a dozen attempts to hit it—and extensions like Stylus aren't readily available on phones, what with Firefox dumping them in Preview. Just making the link a dozen characters wide would be splendid.
It looks promising for these kinds of daunting tasks
And additional follow-up work on extracting data from PDF datasheets is here:
One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.
Most of our Xerox printers spoke Postscript natively, these days more printers can use PDF. We generally used a tool to convert PCL to PS to suit our workflow if that was the only option for the file, because being able to manipulate the file (reordering and applying barcodes or minor text modifications) was important. Likewise for AFP and other formats. PCL jobs were rare so I never worked on them personally.
If someone could make a service that lets you upload a PDF that contains a form, and then let users fill out that form and e-sign it and collect the results, and then print them out all at once, it would be great.
It's not a billion dollar idea but there are a lot of little companies that would save a lot of time using it.
* JotForm (https://www.jotform.com/help/433-How-to-Add-an-E-Signature-t...)
(I know about all these because I'm working on a PDF generation service for developers called DocSpring . I'm also working on e-signature support , but that's still under development, and still won't be a perfect fit for your use-case.)
It lets me type in to forms - or draw text over them if necessary. Then I paste in a scan of my signature. Then save as a PDF an email across.
I've been doing this for years. Job applications, mortgages, medical questionnaires. No one has every queried it.
If you're hand delivering a printed PDF, it's just going to be copy-typed by a human into a computer. No need to make it too fancy.
In theory it sounds like it should be straightforward but it hinges so much on how well the document is structured underneath the surface.
Being that these tools were primarily designed for non-technical users first the priority is in the visual and printed outcome and not the underlying structure.
One document can look much the same as another in form—uses black borders to outline fields, similar or same field names, etc, but may be structured entirely differently and that can be a madhouse of frustrating problems.
It can be complex enough to write a solution for one specific document source. Writing a universal tool that could take in any form like that would probably be a pretty decent moneymaker.
My first intuition, though, would be it may be more successful (though no less simple) to develop a model that can read from the visual of the document rather than parsing it successfully.
Open to learning something here, though!
There is a wide world outside of consumers of SaaS products for every little niche problem.
Sometimes they are baked in processes that still use PDF's to share information, sometimes they're old forms of any kind, sometimes even old scanned docs that are still in use but shared digitally. A lot of the businesses that carry on that way are of the mind that "if it's not broke, don't fix it" which is quite rational for their problem areas and existing knowledge base. They might be a potential market at some point for a new solution, but good luck selling them on a web-based subscription SaaS solution when a simple form has been serving their needs for 30+ years.
OP's problem of the PDF being the go-between to digital endpoints is more common than you might think.
The universality I was referring to was the wide range of possibilities for how a given form might be laid out. And old documents contain a lot of noise when they've been added to or manipulated. Look inside an old PDF form from some small-medium sized business sometime. Now imagine 1000 variations of that form one standard problem. Then multiple that by the number of potential problem areas the forms are managing.
Also like OP said—it's not sexy, but it's very real and having an intelligent PDF form reader and consumer would be a time-saver for those businesses who aren't geared to completely alter their workflow.
The tool could do anything with the extracted data. If it allowed you to connect to any of your in house services (like payroll or accounting) either with a quick config/API or a custom patch, or Google Drive, or whatever without complications like online-required and web accounts especially. No whole solution like that exists to my knowledge. At least nothing accessible to the wider market.
Thanks again, I just want to make sure I understand.
OTOH, a PDF form works exactly they way you’d like. Maybe there’s a small market in helping convert one to the other for collecting input from old paper-ish forms.
Again, not sexy, but it is so stupid I have to fill out a direct deposit form by hand and turn it into my company, who checks it, then hands it off to the payroll vendor, who has to check it, just to enter the damn data into a form on their end.
Sorry, this is a bit off-topic regarding PDF extraction, but it distracted me greatly while reading...
I'm pretty sure the intention was A B C D (cut then wash). Not sure why the author would not use alphabet order for the recipe...
 Sorry, I made it read to a colleague and he mentioned the A B C D annotations were probably not in the original document. This was not clear at all for me while reading, and if they are not included it's indeed hard to find the correct paragraph order.
And of course, even if the letters were there in the original document, it would be clear to a human that they're incorrect because it doesn't make sense to wash vegetables after cutting.