
Using Microsoft Word with Git - the_dripper
https://blog.martinfenner.org/2014/08/25/using-microsoft-word-with-git/
======
codingdave
My SaaS deals primarily with legal documents that for years had been
maintained with Word. The pain of emailing documents is real, but the comfort
level with how Word works is also real. Over the years, most organizations
have developed internal workflows to share and send documents around that
bypass the pains, and while they may not be perfect, they work.

The funny thing is that the document authors like these ways of working. It is
the tech people who don't. I've seen "Git for Word" proposed many times a year
for a while now. And all of the ideas are interesting, but none of them appeal
to my audience because they don't care about git's feature set. Nobody wants
to branch and merge. Nobody wants a straight version history. ("Nobody"
meaning nobody in my market, not nobody in the world.)

They want a storytelling experience. They want to know the why, not the what.
And the workflow tends to be unidirectional, not with collaborative changes
coming back together, but with expanding changes as each person adds their
ideas and makes change for a specific instance of using a document. The
experience we build for them bring in pieces of version history, pieces of
comments, pieces of telling the story of why something was done, so people
down the line can have more context to decide whether to accept or reject the
changes.

It isn't that "Git for Word" is a bad idea - on the contrary, it would be
great if someone pulls it off. My point is that building something that
improves on Word isn't actually about the software, it is about the document
workflows. If you find groups who work like software devs do, where documents
receive small updates from a team, and bring all changes together for a final
product, there is probably a market. But when evaluating such ideas, there has
to be a reality check of whether the actual use of the documents truly matches
the use case for git.

~~~
atoav
As someone who is at the intersection of tech and arts one of the things I
like about using git in projects is that it is very clear what is the latest
official defintive final variant of a piece of data and you don't have to ask
anybody to get it.

When I worked as a VFX freelancer I was amazed at the number of hours (=money)
burned by marketing agencies who didn't manage to give me the definitive
variant for a simple list of things they wanted. In one instance they gave me
everything they had, including crude and unrecognisable filenames, hints about
things that I should ignore via telephone etc. I had to make sense of it and
compile a list which I sent them to approve. They ended up approving another
list (!) which they themselves sent me two weeks prior and they only managed
to correct this once I hinted at this.

Of course this is a example of saw qhow things should never be. This usually
involves somebody getting sick and some uninformed person taking over etc. But
what I learned on film sets is that you should choose the defaults of your
communication culture in such a way, that it works under the absolute worst
conditions (bad weather, hungry, stressed, confused, etc).

And I have seen so many organisations fail at precisely that. If you get I'll
someone else should be able to take over without heading to an oracle. This is
not a special function limited to a version control workflow, it is something
that has to do with clear communication.

Using git can sometimes help avoiding the whole problem by making it obvious
which file is the latest and which is a variant of it, the people using it
will have to use clear communication as well (e.g. by writing good commit
messages, choosing the "right" commit sizes, naming things the right way etc).
So if you know how to use git, you just might value clear communications a
little bit more than the average person.

~~~
gumby
> As someone who is at the intersection of tech and arts one of the things I
> like about using git in projects is that it is very clear what is the latest
> official defintive final variant of a piece of data and you don't have to
> ask anybody to get it.

As git is a distributed system I think it’s _not at all_ clear what the
_definitive_ final variant might be —- and that is a strength.

That can be handled externally to git via ad hoc convention, say by using a
system like gitlab or github and letting it declare one as “primary”, or by
having someone post to a mailing list (“Commit X on a repo you can reach at
URI Y is the official release”) both of which are common.

But in your example various people could mail you commits and not have any
consensus on which is authoritative.

~~~
ISL
Git Tags can substantially address this concern.

------
tomashubelbauer
I've mentioned this in a similar thread a few months back, but it looks like
it could be relevant here, too:

[https://github.com/TomasHubelbauer/modern-office-git-
diff](https://github.com/TomasHubelbauer/modern-office-git-diff)

I've made this script which automatically extracts the Office file format
(which is a ZIP archive of XML documents) and versions the XML documents and
their extracted text contents alongside the binary Office file. This is done
using a Git hook and it seems to work pretty well. If you're in need of
versioning Office documents, this might be a good enough solution for you.

Edit: I should also address why not use the built-in Office versioning
feature? The reason I don't use it is because I like to be able to view the
diffs in Git. I don't want to have to use Office just to see the changes. My
solution offers that. By doubling-up the way the original is versioned in the
way of tracking the extracted XML and text contents as well, each commit's
diff will have the binary change as well as the textual diff which in my
experience is good enough to tell the gist of changes. And you're using
standard Git / text manipulation tools you would use with any other diff.

~~~
rperez333
This looks very interesting. Do you think it can be applied to other kinds of
XML files? I'm interested in using git with a vfx software (The Foundry Nuke)
that writes XML projects, and it would be great to have some versioning system
for it.

I've tried using the git diff patience algorithm, but didn't work well -
frequently, the diff was about to remove every single line and add all them
back to the XML file.

~~~
tomashubelbauer
In your situation, I'd just whip together a quick PowerShell script like I
have here, but tailor it to the structure of your file format: traverse the
XML tree and have a few if-else statements which filter out noisy metadata you
don't need to see in the diff, if any, and save the resulting collected text
node contents as a text file alongside the XML files. Each commit with changes
to the XML will thanks to the Git hook also have a corresponding TXT file so
you can very easily view the changes in a skimable way, unlike the potentially
really big and messy XML diff you'd have if you versioned only the original.

~~~
rperez333
thank you guys for these ideas, both sound great (powershell script and
linter) and I'm confident I will get something working now!

------
unnah
On Windows you can just use TortoiseGIT, it can do diff and even merge by
calling Word's internal compare tools. I can attest that diff works fine
(differences show up as if you had used track changes within word), but I
haven't had occasion to try merging Word documents with TortoiseGIT yet. The
same functionality was already available in TortoiseSVN.

~~~
kats
Omg! Thank you!

------
bugmen0t
FWIW, if you're using libreoffice write you can save your file as a flat odt
file (.fodt), which gives you a version-controllable format

~~~
est31
Interesting, I didn't know about fodt! Only knew that godot engine had done
something similar (for git specifically).

I downloaded a docx document from the net, opened it in libre office, removed
a single word, saved it as fodt, removed a single word again, saved it as fodt
again, and the diff between the two fodt is gigantic.

Apparently there are lots of items like _< text:p text:style-name="P20> _
whose content didnt change, but their ID did. It didn't even only affect IDs
of content after the removed word, but content before as well.

The file has 19361 lines and the diff size is 1110 lines so there is some
level of locality, but note that a lot of those lines are just base64 data of
image content. The fodt is 1.5 times as large as the original file.

Try it yourself, this is the document:
[https://www.acquisition.gov/sites/default/files/manual/SOP_P...](https://www.acquisition.gov/sites/default/files/manual/SOP_PSC_Updated_Process_Final_March_19_2020.docx)

~~~
wizzwizz4
You have to save, close, re-open, save, close, re-open a few times before the
diffs stabilise – and even then it'll seemingly-arbitrarily rename all the
tags.

I recommend having a commit hook that (somewhat) pretty-prints and line-wraps
the XML – perhaps splitting on sentences too, so that adding a word doesn't
proliferate all down the page. I haven't tried this, though, so it might not
help. If you do, could you release the code?

------
binbag
I don’t understand why this 6 year old article has been posted when current
Microsoft 365 versions of Word et al have built in version control and real
time collaboration.

~~~
anjc
Yep. They've had version control for at _least_ a decade, diff'ing also, by
way of Compare. I'm also not sure why people are fascinated with using git
here. It's weird seeing all of the complex solutions in this thread for a
problem which does not exist.

Edit: I meant 'fascinated with using git here in this context'.

~~~
thrownaway954
i don't know why you are getting down voted for speaking the truth. I don't
know a single person who this article would relate to in today's climate since
everyone i know is using the latest version of office or on google docs.

~~~
BeetleB
The Word compare feature is often a pain to use. This is a decent alternative.

~~~
rrrrrrrrrrrryan
The parent was just mentioning that "compare" has existed for a decade.

The latest Office products have proper real-time collaboration and change
tracking, a la Google Docs.

------
MarcScott
I really don't understand how Word remains so popular. It was created at a
time when few people had internet access, and was designed to produce printed
documents. It was the perfect tool to write newsletters, flyers, articles,
academic papers and manuscripts. The world has moved on though, and I fail to
see Word's relevance today, other than the sheer number of people that are
familiar with it.

Word is expensive, proprietary and the XML it generates is unfathomable. There
are so many better FOSS tools and systems that we could be using. If you're
collaborating on a document then markdown or LaTeX has you covered. You get
version control though git and multiple people can contribute. If you're
writing a book or article, then the graphic designers and typesetters are
going to make the design decisions, not the author, so why bother messing
around with fonts and colours and the infuriating placement of images and
tables.

I authored a kid's book on coding, and the process was a nightmare. I authored
in markdown, used pandoc to convert and then further edited in libreoffice, to
be able to send stuff through in docx format. Then revisions were sent back in
docx and I had to reverse the whole process, so I could maintain my plain-text
version of the book. Then the proofs were sent through as PDFs, which I then
had to markup for corrections. Many of the mistakes were due to the crappy way
Word places images. In the end I just bought a copy of Word, and submitted to
the way my publisher wanted me to work, which disrupted the authorial process.

It's time we ditched Word, in the same way we ditched VHS and DVD. It's an
outdated technology that remains dominant just because everyone uses it at
school, and then refuses to move on. If schools insisted that all homework was
submitted in something like markdown, we'd see a dramatic change in a very
short period of time. (BTW when I was teaching CS, my kids authored in
markdown and submitted on GitHub)

Right, rant over - but I've been talking about this for years
-[http://coding2learn.org/blog/2014/04/14/please-stop-
sending-...](http://coding2learn.org/blog/2014/04/14/please-stop-sending-me-
your-shitty-word-documents/)

~~~
viraptor
> If you're collaborating on a document then markdown or LaTeX has you
> covered.

These are not WYSIWYG solutions which answers 99% of your question "why". When
people want to write a document they want to write things and have the things
appear on a page, possibly in different formatting. Injecting ideas like
source files, rendering pipeline, etc. will just result in confused people.

That's why online solutions like Google docs are popular. No special app,
things look like expected, you can collaborate, and few people actually need
any fancy features.

~~~
MarcScott
I think my argument is that WYSIWYG needs to die. For the vast majority of
people they want nothing more than:

> text

> image

> more text

> table

> more text

There are any number of applications that allow you to write markdown and view
the generated HTML in whatever formatting you want. Your recipient then gets
to choose their own fonts, colours etc, which from an accessibility point of
view, is much better.

Unless you're printing a hardcopy or creating a PDF, what is the point of
Word?

~~~
WoodenChair
> I think my argument is that WYSIWYG needs to die.

WYSIWYG is a big part of what made the GUI revolution so successful. The
computer for the rest of us, wouldn't be for the rest of us, if we had to
worry about Git and how to render our file format.

I've had the same frustrations dealing with publishers and Word templates as
you had. Your mistake is that you are conflating our experience writing a
technical book with the vast majority of users who are not writing technical
literature. A writing system for the masses should be as easy to use (for the
basics at least) as paper and pencil. Git and learning even a simple markup
language does not meet this standard.

~~~
beagle3
WYSIWYG for professional word processing is like training wheels - it lets you
start being productive on day one, but if you don't spend the effort to learn
how to work without them, they get in the way and make you slower -- although
you wouldn't know that unless you've seen someone who can do the job without
them.

I have not used Word for ~10 years, but not in the last ~20 or so years, after
I realized how much time and effort it cost me -- nearly missed an important
deadline because of a Word 2 vs Word 6 incompatibility that manifested in a
very inopportune moment.

It's been around for almost 30 years. I'm constantly receiving documents from
people who've used it for >25years. And there is _never_ use of styles,
_often_ spaces instead of tabs, many "new lines" instead of a page break, and
a host of other things like that. References are not dynamic (just typed out)
meaning that an item inserted in the middle of a list makes many of them
wrong.

The vast majority of people who have used it for decades use it mostly as a
smart typewriter, because the "pro" features like styles require a lot of
discipline and the "let's just press the bold button" is too easy and
enticing.

WYSIWYG needs to die whenever anything professional is needed.

------
bovermyer
I gave up using Word to write manuscripts when I switched to Markdown
documents in git.

In the last few months, though, I gave up on Markdown to switch to a more
robust format - LaTeX. Before I switched, I didn't know LaTeX at all, but I
knew from my reading that it had the features I needed.

~~~
ectoplasmaboiii
I don't know if you're already familiar, but pandoc perfectly bridges this gap
for me. You can write things in markdown, then covert it latex no problem with
pandoc. You can even make templates for it, and write mathmode in markdown.

It certainly makes for less _noisy_ source files in my opinion, and it also
means that you get to take advantage of the fact that, if you want to, you can
easily convert your markdown to HTML, with maths using something like mathjax.

This was a bit of a ramble, but I honestly can't say enough nice things about
pandoc.

~~~
bovermyer
I was converting Markdown to PDF with pandoc before I started writing straight
LaTeX.

It's worth noting here that I'm writing _layout_ in LaTeX also - like
controlling the number of columns, where breaks exist, etc.

------
rhn_mk1
Not long ago I read some article here on HN that the world is still waiting
for a git equivalent for documents. This seems like a good start.

Now we need a native diff viewer for structured files, where the changes are
presented with attribution either side by side, or alongside (like gitk, or
like gitlab diff viewer).

Then we need an editor that supports doing the gitty stuff natively, so that
the non-technical writer doesn't have to worry about creating repos and
committing the changes from the command line.

~~~
ben-morris
After a lot of trying to get Git and Word to play nice together, we ended up
building a collaboration tool to bring the power of Git (branching & merging)
to non-technical Word users.

Feedback welcome: [https://www.simuldocs.com/](https://www.simuldocs.com/)

~~~
dvfjsdhgfv
Congrtulations, it seems really easy to use. Just one thing: I noticed
formatting changes (such as italics) are ignored. Is this deliberate?

~~~
ben-morris
Thanks! It is, it's quite hard to highlight exactly what changed regarding
formatting and we found our users mostly cared about content changes.

Having said that, adding an option to include formatting changes is on our
roadmap.

~~~
zwp
Can an attacker use a formatting change to alter the sense of a document? Eg.
make the text color the same as the background color or invisibly small?

~~~
ben-morris
Potentially this could be used to remove some content (or make it appear
removed) without it being highlighted. I doubt this is an issue in practice,
considering there's a full audit trail and collaborators are usually trusted,
but this is good feedback and we'll see if we can improve this.

------
PaulHoule
You do know that a Word document is really a ZIP file? The text content is
inside an XML document that, in principle, Github would work on. All you have
to do is unzip the document, store the directory in GitHub and repack it for
Word to use.

~~~
acbart
Wait, wouldn't this actually be fairly simple to set up? I'll admit my Git
knowledge is a little shaky, but couldn't you set up a Hook that runs when you
commit a Docx/Pptx/etc. file to unzip it in memory first, and then another
Hook when you checkout to zip it back into the original structure? I guess
conflicts are the major issue... asking users to navigate the XML/binary
structure stored inside could be a mess. A GUI could help, but that would
invoke different issues.

ETA: Apparently right below this comment someone has already created this:
[https://news.ycombinator.com/item?id=24303611](https://news.ycombinator.com/item?id=24303611)

------
josteink
> since earlier this month Pandoc can read Word documents in docx format.

Given this line, I think it's fair to add (2014) to the title.

This is pretty old news by now :)

------
jacobmischka
Before Word integrated its own improved version tracking in more modern
versions, during my undergrad I participated in a research project to add
version tracking to Word documents by abusing its zip file format[1]. My
research partner created a plugin to manage the versions, and my main
contribution was a Java tool that attempted version merging[2].

It wasn't fleshed out or usable, but it was an interesting project. I was
impressed at how open the Word/Office format was, this was before Microsoft's
reemergence into openness and open source.

[1]:
[https://dl.acm.org/doi/10.1145/2723147.2723152](https://dl.acm.org/doi/10.1145/2723147.2723152)

[2]:
[https://github.com/jacobmischka/Vvord](https://github.com/jacobmischka/Vvord)

------
usrusr
Ages ago I wrote a little Word VBA that exported a plaintext copy to go along
with the .doc every time I hit save. Worked quite well for eyeballing the
changes in a diff. Obviously you don't get merge support for .doc but since
that was still running on SVN where workflows tend to be less merge-heavy (or
was it still CVS? I feel old..) and I was working solo anyways the human-
readable diff worked well enough.

------
jksmith
I'm using Fossil for my book. My book is about business systems simplicity so
it's a great fit. If I hadn't started using sqlite for a project, I would have
never even heard of Fossil. What a great, beautifully simple combination.
Don't add complexity unless the complexity is worth the dysfunction it
addresses.

------
Gaelan
It's currently badly broken—see the issues, someone points out what needs
fixing—but I have a tool that uses Word's built in track changes functionality
as git diff backend:
[https://github.com/Gaelan/WordGit](https://github.com/Gaelan/WordGit)

------
ivan_ah
Wow nice. I'm a big fan of git's `--word-diff` option for text edits. The
output is almost as good as `latexdiff` and so much faster.

Another useful trick is to pipe the ANSI-colored terminal output through `aha`
([https://github.com/theZiz/aha](https://github.com/theZiz/aha) or `brew
install aha`) which produces HTML output, e.g.

    
    
       git wdiff | aha > ~/Desktop/mydiff.html
    

You can then send the file mydiff.html to collaborators by email or add to CI
build script.

------
bisrig
Back in the bad old days of version control (thinking of VSS here), I was
overall pretty satisfied with how the check-in/check-out mechanics worked for
Word docs and the like. In this case you have the benefit of the sequential
workflow, in fact enforced or hinted by the tool itself, while also getting
rid of the recurrent weakness of email-based document storage. There were
plenty of other things to dislike about VSS (like, pretty much the rest of
them) but it wasn't so bad for maintaining documents.

------
noyesno
I work with telco standards and the organizations that I follow use Word
documents. The way we keep a paper trail of all the changes to a new
standard’s draft is by separating the change proposals into their own
documents (using change marks against the latest agreed draft) and only
allowing a named editor to actually implement the agreed change proposals back
to the master document. The change proposal documents, together with the
meeting minutes create a perfect history of who proposed what changes and
when.

------
tmaly
I implemented something just like this but in CVS for legal documents back in
2007.

I am finally replacing it with a sharepoint solution. Its a headache to have
to maintain versions for non-technical people.

------
erichdongubler
Has anybody used SimulDocs[0], which sells itself as a "version control for
Microsoft Word documents"? I've been really curious if it's a decent solution
in this space, but I tend to keep myself away from Word docs in my life
recently.

[0]: [https://www.simuldocs.com/features/version-control-for-
micro...](https://www.simuldocs.com/features/version-control-for-microsoft-
word)

------
formercoder
On a related note - anyone have a good pdf comparer?

------
vivekkalyan
My solution is inspired by this blog post but creates a global attributes
file. I documented it here:

[https://www.vivekkalyan.com/using-git-for-
word](https://www.vivekkalyan.com/using-git-for-word)

I tend to prefer markdown for most things, but find it hard to beat Word in
terms of simplicity of elegant designs for, say, resumes.

------
lovetocode
.docx is just an archive format. If I remember correctly the contents inside
the .docx archive are plain text. Can’t we just use version control inside of
there? We would have to of course figure out a way to have git unpack and pack
the archive each time.

~~~
tinus_hn
It’s a zip, that’s not the hard part.

Apart from attachments and metadata the actual document is some kind of xml
monstrosity that contains the text and the markup. It’s not very useful to
just create diffs from that, it looks a bit like the HTML created by FrontPage
if you remember that.

You can just rename a docx file to .zip, unpack it and peek around.

~~~
mehrdadn
The XML might be awful for viewing but I do wonder if it would diff better for
storage? Git is awfully inefficient for storing binary data.

~~~
achr2
Not really, as there isn't a linearity or markup feel to the XML. Outside of
straight text changes, formatting, rearranging, and internal markups, are not
possible to 'visually' diff in the XML.

------
Apofis
At first, I'm like... but it's just a zipped archive of XML and other content
files which can be used with git successfully, but yeah there's a mess in
there. It's not really meant to be human readable.

------
greenie_beans
yes! i have a collection of tweets where fiction/non-fiction writers joke
about naming their versions different things. i'm like, use git?

i wrote a novella using a folder system + text editor + git. i'm trying to put
that into a web app. don't know how useful it would be for other people
though. and don't know if it will ever be finished because i need to write.

~~~
rat_1234
I wrote a CLI tool in Python for this exact use case and am currently using it
to write a novel. Basically the CLI tool solves a lot of the tedious issues
that come up (e.g., combining all of the text files, reordering them, etc.)

If you like writing out of a text editor (I use Atom) it's super useful.

[https://github.com/edelgm6/draft](https://github.com/edelgm6/draft)

~~~
greenie_beans
cool, will check it out. thanks

------
chromedev
Just use Markdown (or similar markup language) and a tool like pandoc to
convert to word if necessary.

------
winrid
This is also covered in the book Pro Git, if you have the patience to read 400
pages about Git.

