so one thing that I do to counter point 3 is by having multiple non-english char...

bscphil · on July 6, 2022

That's a fair point. I was trying to point out an interesting relationship between your approach and various forms of cryptography, but obviously this would not be the first line solution to the problem. I think the resistance to crawlers will come down to whether they implement OCR or not. I suspect some of the more sophisticated ones do. (A naive one might also be fooled by the fact that real glyphs are used and not attempt to OCR the text.)

BTW, you probably shouldn't say it's resistant to PDF reader OCR because most PDF readers don't have OCR, AFAIK. They just pull the text from the document, that's not OCR. Software that has OCR like Adobe Acrobat will not be fooled by your obfuscation if you render it to bitmap or textless vector first. If OCR doesn't work on the document as is, it's only because the presence of text glyphs fools it into thinking there's nothing there to perform optical character recognition on.

viggity · on July 6, 2022

I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).

pwg · on July 7, 2022

In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.