Hacker News new | past | comments | ask | show | jobs | submit | robinhowlett's comments login

This looks excellent - i've been looking for a timeline-based, simple-entry tool like this. Well done.


Thanks for the links - agree about the (x,y,text) callout but other metadata like font size can be useful too.

Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.

I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...


I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.

[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...


Oh, yeah, pdf2json returns font sizes as well. I forgot to mention that.


pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :

https://github.com/AXATechLab/pdf2json

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....


Disclaimer: written by CEO of my company


It's funny how this area seems to attract titles like this. I used the same for my blog post, "Everything You Ever Wanted to Know About SSL (but Were Afraid to Ask)"

http://www.robinhowlett.com/blog/2016/01/05/everything-you-e...

Disclaimer: I write blog posts to figure out if I understand something correctly and appreciate any and all feedback, negative or positive.


It's a question I've asked myself too. I've also wondered how often others are being aggressively pursued by former colleagues to join them at another company. I've seen team strongly believe they hire strong candidates only, but rarely seem to pursue these same people after moving to a new company.


This comment resonates more than any other. I absolutely agree with it. A complete demarcation in my life's focus and goals before getting married and, especially, having children.


I got into a similar pickle recently and did some press about it: http://www.thoroughbreddailynews.com/getting-from-cease-and-...

I'm also working on a Chrome Extension that modifies the functionality of a website, but this is giving me second thoughts.


Nice looking client. I've been tipping away at a JavaFX application of late. I'm shipping with a JVM and bundling into .app/.exe distributions. Looking forward to seeing what you did to learn from it.


Yes I encountered similar issues but many of them were able to be solved.

With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.

See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...

and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique


Very cool, good to see the level of control this package allows.


Author here; well, PDFBox is good for simple text stripping. If I wanted to print all the text on the PDF, that would be very straightforward and not much code. However, the PDF chart here is in essence a representation of structured data. I wanted to get the content in that format so that I could both serialize to JSON plus have an SDK to boot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: