then you at least need a pdf reader that implements that, and have to be sure that the parsing you do have cannot be exploited, while still giving a useful representation. This might be easier for ML where you don't care about visual display, but a human generally doesn't want to read raw, unformatted text. And a surprising amount of stuff is probably needed for a half-way decent visual display.
I do view the documents we format after they have gone through the processing stage. They seem to be the same in most ways I would care about. Diagrams are still present, etc. I don't know about PDFs that contain forms as these are not those kind of documents but closer to research documents.