Yeah that was the interesting part to me, at least. Plus, it's Microsoft so hope...

_rs · 2024-12-14T06:11:29 1734156689

That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly

disgruntledphd2 · 2024-12-14T08:26:15 1734164775

Wow, I dunno if that's good or bad, certainly it's not what I expected.

wis · 2024-12-14T11:51:24 1734177084

Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).

They used Mammoth for docx (Word) [1][2] Python-pptx for ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]

[1] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [2] https://pypi.org/project/mammoth/ [3] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [4] https://pypi.org/project/python-pptx/ [5] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3...

jamwil · 2024-12-14T15:15:51 1734189351

COM requires you to interact with the files through the associated MS Office applications, whereas these libs parse the ooxml file format directly.

LordDragonfang · 2024-12-13T19:47:45 1734119265

...I did not catch that it was from Microsoft. I was wondering why a random markdown converter was so notable.