To deal with bad pdfs I split each pdf to single .png pages with ghostscript, then ocrmypdf those .png files which outputs them in pdf with text layer. With the —sidecar option it also outputs a .txt file. I then concatenate all the single page pdfs and dump the .txt in a database for better searching.