

I made a native Python module for MS Word docx - nailer
http://github.com/mikemaccana/python-docx

======
nailer
Hi Hacker News,

I was recently wandering around StackOverflow and PyPI and wondering why the
only solutions to make MS Office documents from Python seemed to be based
around either COM automation, OpenOffice automation, or calling Java or .net
libraries.

So I made this module, which reads and writes Microsoft Office Word 2007 docx
files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by
Microsoft.

They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008,
OpenOffice.org 2.2, and Apple iWork 08.

The docx module has the following features:

Making documents

• Paragraphs

• Bullets

• Numbered lists

• Multiple levels of headings

• Tables

Editing documents

Thanks to lxml module, we can:

• Search and replace

• Extract plain text of document

• Add and delete items anywhere within the document

• Run xpath queries against particular locations in the document - useful for
retrieving data from user-completed templates.

It's only a couple of hundred lines - lxml does most of the heavy work - but
it's incredibly simple to use, check out example.py in the link.

Hope you find it useful.

Mike

~~~
marcinw
Thanks Mike for providing such a useful module!

Some plans I have to add to this module (if you don't plan too already):

Add support for:

* Pass a font-size, bold, italic, bold+italic and spacing to paragraph method to modify font

* Pass a table row header to table method

* Specify table column widths

* Change text alignment for table rows/headers, etc.

* Change text style in table cells

* Ability to change background color for every other table row

* Create method to insert a page/section break

* Create a method to add an image at a specific position in document page

On one other note:

I'm not 100% positive, but thumbnail.jpg/thumbnail.wmf may result in
inadvertent disclosure of sensitive data if using this to generate report
documents...

~~~
nailer
Awesome! I welcome contributions. Could you send me something for page &
section breaks?

I'm already working on images and 100% nose coverage. I'm intending to do
document properties after that.

Re: zebra striping for tables, we already have that via the inbuilt styles.
But more options to control styles would be useful.

Coding style is:

* Functional

* Google style - <http://code.google.com/p/soc/wiki/PythonStyleGuide>

* Unit tests are handled with nose / coverage

------
dirkstoop
Awesome, thanks. Are you putting this in the public domain, or are you
licensing it under any particular license agreement?

I couldn't find anything about a license or copyright on your project's github
page.

~~~
nailer
Good point. I've licensed it under MIT (originally I thought github made me
pick a license, but it seems not), see the updated README.

------
bigsassy
Python also has a native Python module for reading and writing Excel files.
It's pretty simple to use and also doesn't require any COM nastiness.
<http://www.python-excel.org/>

------
elblanco
docx has been a real boon to people writing things like search engines and
want to use the text inside of the files.

We've been using <http://b2xtranslator.sourceforge.net/blog/> to convert
legacy binary office formats to openXML, then using XSLT to transform the XML
into our preferred schema for text extraction.

------
benatkin
Thanks! I've had to deal with docx files from time to time. I'm going to try
this out.

Does anyone know of a good way to read xlsx files with Python? I think I tried
one library and it didn't quite work for this 20,000 row file, and I wound up
using OpenOffice to convert it to csv. If not, hopefully this can be used as a
starting point for developing a nice xlsx library.

~~~
brendano
I wrote the following python script to read xlsx. It's extremely basic, but
works very well for me.

<http://github.com/brendano/tsvutils/blob/master/xlsx2tsv>

I've never tried it on a 20,000 row file. I suspect it would work. A really
large file might need to switch to a streaming XML parser, but probably Excel
itself wouldn't handle that use case too well.

------
est
A quick hack for if you need .doc not .docx format: generate .rtf and rename
it to .doc :)

PS, .eml and .mht are actually the same format.

~~~
stevenbedrick
An even worse hack, that unfortunately has worked really well for me in the
past: generate a simple HTML file (em, b, u, ul/ol, etc.), rename it to .doc,
and Word will open it without appearing to notice or comment. This saved my
butt once when I discovered mid-project that the Textile library I was using
couldn't generate RTF... I felt more than a little bit dirty, but it worked
without a hitch.

------
brendano
Great, thanks for doing this!

A while ago, I wrote a little Python script to read from xlsx Excel files.
It's nice that these Microsoft Office XML files can be processed in pure
Python.

<http://github.com/brendano/tsvutils/blob/master/xlsx2tsv>

------
k_shehadeh
Just wanted to add my thanks to the list. This was one of the bigger holes in
document processing and now it's been filled for the most part.

------
pmorici
Is there something similar that will support .doc and / or .ppt files that
doesn't rely on com or the open office Uno bridge?

~~~
cschneid
About the closest you'll get is the apache POI library. It's mostly an excel
97 lib, but it supports word to some extent.

------
wenbert
Thanks for this. This will make other people's lives so much better. This is
good for everyone.

------
anon42389475
this is great. thanks.

