> Copying the text gives: “ch a i r m a n ' s s tat em en t” Reconstructing the ...

hrktb · on Sept 14, 2020

It's doable, but it's only the easiest cases. If it's a one off script, or some automation you give as a base to a human to review afterwards it can be a good first step.

If it's supposed to be a somewhat final result, ran against a dataset you have little control over (people sending you PDFs they made, vs. PDFs coming from a know automated generator) you'll hit all the other not so edge cases very fast.

Like, otherwise invisible characters inserted in the middle of your text, layout that makes no logical sense and puts the text in a weird order but was fine when it was displayed on the page, characters missing because of weird typographic optimization (ligatures, characters only in specific embeded fonts etc.). Basically everything in the article is pretty easy to find in the wild.

mcswell · on Sept 15, 2020

You're unlikely to find "chairman's" in a dictionary. Spell checkers (which at least for languages like English usually use table lookup) often succeed with common apostrophe-s nouns, but fail with less common ones, because even for English it's often considered impractical to store all possible possessive forms.

Where dictionary lookup really becomes infeasible is in languages with a great deal of inflectional morphology, like Finnish. For languages like that, you need some kind of morphological analyzer. Hunspell provides a fairly simple one, while a number of finite state transducers (xfst, foma, sfst...) provide more sophisticated mechanisms to build a morphological parser with.

kzrdude · on Sept 14, 2020

But it might already be hard for a simple procedure to know if it should be chair man or chairman here.