Hacker News new | comments | show | ask | jobs | submit login

  The only way to learn how to recignize and exploit the defects in a lock is to practive. This means practiving many times on the saem lock as well as practiving on many different locks.
Not quite related, but is there a reason that practice is consistently misspelled? Perhaps some interesting lore?

It's been transcribed automatically to HTML from a PDF document which was typeset using LaTeX. The original document does not contain these errors:


The errors are an artifact of the lossy process which goes from the actual text content, to semantic LaTeX source, to PDF (designed for print reproduction, not content portability), and back to HTML. This last step might even be using OCR.

But the errors may be present even without invoking OCR - I often find that I can't copy text from a PDF generated by my professors' TeX toolchains because the various ligatures, kerning, and other subtle effects that Tex produces from letter to letter mangle the paste buffer. Also, while the default font (Computer Modern) looks fantastic and very professional when rendered correctly, and looks even better with TeX typesetting adjustments, many PDFs are generated with bitmap fonts and then rendered on systems which attempt to perform or remove anti-aliasing, DPI scaling, smoothing, and other effects. You can see some of this in the above document.

hm... although weird, it seems more likely to me that the errors were produced by humans retyping the document. Since c doesn't look like v, but c is right next to v on the keyboard. Other examples of people misspelling "-ice" as "-ive": https://youtu.be/ZtIVWWpZRJQ?t=51 And the example in another comment of pick -> pikc is also more indicative of a human typo rather than OCR. Dunno why they'd retype the document though...

PDF is a terminal format, so this isn't really avoidable in all cases.

If you want HTML, usually much better off to use latex->html tooling.

If you have the source, it's obviously better to use `hlatex`.

But PDF is only intended to be a terminal format. In the real world, though, it's very common for the 'terminal format' - whether a binary executable or a PDF - to be the only format available.

It would be very useful if the toolchains used to produce PDFs - whether `latex`->`dvips`->`ps2pdf` or `pdflatex`, or any of the other possibilities in the extremely complicated TeX ecosystem - did a better job of maintaining the semantic and raw-text content of the source.

I would happily increase the size of all my PDFs by a couple percent if it meant I could better extract the contents in the future. I do realize that when you multiply this few percent by many gigabytes of PDFs on archive sites and across many uploads and downloads, it becomes more important, but I would assert that it increases the value of those PDFs by more than it costs.

I wonder if you could include the latex source as an unreferenced stream in the pdf document. If there was a standard around this, we could have tools to convert compatible pdf documents into whatever format easily.

Libreoffice can do this so I guess Latex would be able to do that too.

possibly OCR?

EDIT: Possibly from the photocopy of this LaTeX document?


  The big secret of lock picking is that it's easy. Anyone can learn how to pikc locks.
There seems to be a large number of spelling errors. Not sure if it's indicative of anything other than lack of editing.

The text is probably the output from an OCR program that was run on a printed copy of the test, and the user didn't scan the output for the inevitable errors that these programs produce.

It looks to be retyped (badly), not OCRd. The PDF has none of these errors, which are not the sort made by OCR.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact