Hacker News new | past | comments | ask | show | jobs | submit login

I'm not anyone involved in this thread (so far), but I've written a minimal PDF parser in the past using something between 1500-2000 lines of Go. (Sadly, it was for work so I can't go back and check.) Granted, this was only for the bare-bones parsing of the top-level structures, and notably did not handle postscript, so it wouldn't be nearly enough to render graphics. Despite this, it was tricky because it turns out that "following the spec" is not always clear when it comes to PDFs.

For example, I recall the spec being unclear as to whether a newline character was required after a certain element (though I don't remember which element). I processed a corpus containing thousands of PDFs to try to determine what was done in practice, and I found that about half of them included the newline and half did not---an emblematic issue where an unclear official "spec" meant falling back to the de facto specification: flexbility.

It's honestly a great example of something a GPT-like system could probably handle. Doable in a single source file if necessary, fewer than 5k lines, and can be broken into subtasks if need be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: