> Copying the text gives: “ch a i r m a n ' s s tat em en t” Reconstructing the original text is a difficult problem to solve generally.
Why not looking for stretches characters with spaces between them, then concatenate, check against a dictionary and if a match is found, remove the spaces.
It's doable, but it's only the easiest cases. If it's a one off script, or some automation you give as a base to a human to review afterwards it can be a good first step.
If it's supposed to be a somewhat final result, ran against a dataset you have little control over (people sending you PDFs they made, vs. PDFs coming from a know automated generator) you'll hit all the other not so edge cases very fast.
Like, otherwise invisible characters inserted in the middle of your text, layout that makes no logical sense and puts the text in a weird order but was fine when it was displayed on the page, characters missing because of weird typographic optimization (ligatures, characters only in specific embeded fonts etc.). Basically everything in the article is pretty easy to find in the wild.
You're unlikely to find "chairman's" in a dictionary. Spell checkers (which at least for languages like English usually use table lookup) often succeed with common apostrophe-s nouns, but fail with less common ones, because even for English it's often considered impractical to store all possible possessive forms.
Where dictionary lookup really becomes infeasible is in languages with a great deal of inflectional morphology, like Finnish. For languages like that, you need some kind of morphological analyzer. Hunspell provides a fairly simple one, while a number of finite state transducers (xfst, foma, sfst...) provide more sophisticated mechanisms to build a morphological parser with.
Why not looking for stretches characters with spaces between them, then concatenate, check against a dictionary and if a match is found, remove the spaces.
> “On_April_7,_2013,_the_competent_authorities”
Same here.