Do you have any suggestions for Python libraries (other than what's mentioned in...

convivialdingo · on July 6, 2023

I've had good luck with python-docx for reading word documents (typically specifications). Tables are supported - but it's not obvious where the table comes from in the document and I had to come up with a hack way to read image captions.

PDF has been hit or miss, but pypdf has improved in the last couple of years. Depending on the document you'll sometimes get random spaces or nospacesatall.

saeedesmaili · on July 6, 2023

I tried python-docx with a bunch of docx files (downloaded from Google Docs). It returns empty strings for hyperlinks and I couldn't manage to fix this. So if there is a sentence like "This is an important link to another doc or url." and the "link" is a hyperlink, python-docx returns "This is an important to another doc or url."

icegreentea2 · on July 6, 2023

Heh, I got a bit into hacking on python-docx last year (the original author seems to be focusing on other things than python-docx now) - I have a fork/branch where I tried to more properly implement external hyperlink functionality (https://github.com/icegreentea/python-docx/pull/7)

I realize now staring at this, that I might have broken API a little. You can't do "text = paragraph.text" anymore, but you can do "text = ''.join([run.text for run in paragraph.runs])" instead.

If you're curious at all why it breaks, it's because in the OOXML spec paragraphs are made up of a ordered list of runs or hyperlinks (and hyperlinks can then contain additional runs). The master branch just implements paragraphs as ordered list of runs (and ignores all hyperlinks).

saeedesmaili · on July 6, 2023

This sounds amazing! Thanks for sharing it, I will try it to see if I can replace it with the main python-docx. For my use case it suffices to have full text of each paragraph (even if it includes a hyperlink) and heading but also be able to have each of them separated when needed.

icegreentea2 · on July 6, 2023

Actually, I just realized that I had provided a 'one-off' hack to a similarish situation here: https://github.com/python-openxml/python-docx/issues/1123#is...

Replace the `qn("w:ins")` in the example with `qn("w:hyperlink")` and that should hopefully work?

convivialdingo · on July 6, 2023

Hey, that's fantastic. I'll definitely check that out.