I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.
pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :
Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....
It's funny how this area seems to attract titles like this. I used the same for my blog post, "Everything You Ever Wanted to Know About SSL (but Were Afraid to Ask)"
It's a question I've asked myself too. I've also wondered how often others are being aggressively pursued by former colleagues to join them at another company. I've seen team strongly believe they hire strong candidates only, but rarely seem to pursue these same people after moving to a new company.
This comment resonates more than any other. I absolutely agree with it. A complete demarcation in my life's focus and goals before getting married and, especially, having children.
Nice looking client. I've been tipping away at a JavaFX application of late. I'm shipping with a JVM and bundling into .app/.exe distributions. Looking forward to seeing what you did to learn from it.
Yes I encountered similar issues but many of them were able to be solved.
With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.
Author here; well, PDFBox is good for simple text stripping. If I wanted to print all the text on the PDF, that would be very straightforward and not much code. However, the PDF chart here is in essence a representation of structured data. I wanted to get the content in that format so that I could both serialize to JSON plus have an SDK to boot.