Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

so one thing that I do to counter point 3 is by having multiple non-english characters map to the same english character and I pick a random one each time. Depending on the input, there can be ~10 or so characters mapping to any single english letter. If you're advanced enough to know about a substitution cipher, you'll figure out how to convert to an image based PDF and then use OCR on that. The reason I have the multiple mappings is so that if a layperson was trying to find all instances of "Billy" they could copy those characters and then search for "Ҽҙӈㅰベ", but the other instances of "Billy" might have the codepoints "Ҽтぃぃヴ".

Again, its resistant to built in PDF Reader OCR, not bulletproof. I'm trying to thwart a crawler or a script kiddie, or a 50 year old divorce attorney. Not the denizens of HN.



That's a fair point. I was trying to point out an interesting relationship between your approach and various forms of cryptography, but obviously this would not be the first line solution to the problem. I think the resistance to crawlers will come down to whether they implement OCR or not. I suspect some of the more sophisticated ones do. (A naive one might also be fooled by the fact that real glyphs are used and not attempt to OCR the text.)

BTW, you probably shouldn't say it's resistant to PDF reader OCR because most PDF readers don't have OCR, AFAIK. They just pull the text from the document, that's not OCR. Software that has OCR like Adobe Acrobat will not be fooled by your obfuscation if you render it to bitmap or textless vector first. If OCR doesn't work on the document as is, it's only because the presence of text glyphs fools it into thinking there's nothing there to perform optical character recognition on.


I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).


In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: