Hacker News new | past | comments | ask | show | jobs | submit login

This is probably a simple find-and-replace task, so I wouldn't bother with proper PDF parsing or libraries. I would:

1. Use pdftk to uncompress it: pdftk input.pdf output uncompressed.pdf uncompress

2. Look at the PDF code (it's text based) to find the image insertion code.

3. Replace all instances of the image insertion code with strings of spaces the same length (there's a table of object byte offsets at the end that you don't want to mess up).

4. Use pdftk to compress it again: pdftk edited.pdf output output.pdf compress

I have a script that does this to remove pen strokes of particular colours so I can e.g. strip out marking rubric on test solutions written on a tablet.

Get the PDF 1.7 spec from https://pdfa.org/resource/pdf-specification-archive/. You're looking for the "Do" operator invoking a named image object defined elsewhere with "/Subtype /Image". See section 4.8, particularly the example on p343. Or, if it's badly done, it might instead be an inline image using the "BI" operator (a bit later in the same section).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: