
Ask HN: Document Conversion from DocX and Other FileFormats to a Specific XSD - realmunk
We are trying to convert a .docx – and later other potential fileformats – into a kind of standard XML. This XML is going to be mapped through an XSLT to the XML of our choice (xsd).<p>For the conversion to be successful, we need to keep as many of the information elements within the document as possible. The most important ones are the structure, the content, tables, lists, and figures (images etc) within the document.<p>We have realised that getting a document that this job is complex, and that there are serious restrictions to what kind of documents we can support.<p>As there are different standards, implementing a converter for each of them would be time demanding.<p>Does anyone have some tips on how to deal with Document Conversion to XML? Any tips on how to proceed?
======
brudgers
My intuition is that an XSLT that can handle everything in a .docx document
has approximately the same complexity as the Word executable because it must
correctly interpret the information in the .docx in order to correctly
transform it. Which is to say that the number of corner cases that such an
XSLT has to handle is very very large. It is a reflection of the many many
developer-years that has gone into the project.

Because a .docx is already a set of XML documents, and Microsoft provides
tools for manipulating them [1], my gut reaction would be to treat Microsoft's
format as the primary form of storage and use those tools rather than
something custom built for a different XML format: it avoids having to design
and build the software for translating .docx files, designing a custom XML
schema, and building tooling for that custom schema.

If conversion is definitely necessary, I'd still think performing it with
Microsoft's tooling is likely to be more productive than hand coding XSLT.

Good luck.

[1]: [https://msdn.microsoft.com/en-
us/library/office/dn467914.asp...](https://msdn.microsoft.com/en-
us/library/office/dn467914.aspx)

