Hacker News new | past | comments | ask | show | jobs | submit login

I am not the original poster, but I also worked on office file formats -- specifically I was one of the poor saps who worked on file import and export for Word Perfect after it was acquired by Corel. Before you send me hate mail, in my defence the code was mostly written before I got to it, and I was merely fixing the innumerable bugs in it.

I'm mostly familiar with the Word file format, so I will restrict my comments to that. It's been more than 15 years since I did this stuff, so my memory is hazy -- specifically I can't remember how the Excel file formats work at all.

Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn't know any better.

Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy work-arounds for things that would be simple if you allowed yourself to redesign the file format. It's pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.

Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That's because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word -- only the file format doesn't tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like "Format like Word 95". The format doesn't say what that is -- because it's pretty obvious that the authors of the file format don't know. It's lost in a hopeless mess of legacy code and nobody can figure out what it does now.

For programmers who have worked on long lived legacy systems before, none of this should be a surprise. People think Microsoft purposely obfuscated their stuff, but when I worked at Corel, Microsoft used to call us up to tell us when we had broken our Word export filter. At least by that point, having Word as a standard file format was a plus for them. However, whenever we asked them what we should do to fix the filter, they invariably didn't know -- we knew more than they did.

> Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

That's actually really interesting. Got any more details on that?

It's been ages and someone told me that they are allowing Word's pagination now as long as you explicitly say that's what you are doing (or something like that). But basically, pagination is incredibly important in legal documents. References are to specific pages and because footnotes can often span multiple pages, it is important that you render it correctly (or else the reference will point to the wrong page).

There is a specification for how to paginate legal documents in the US, but I don't remember where to get it. IIRC it is based on the Chicago Handbook of Style. The other place you can find the correct explanation of pagination is in the TeX source code, because it does it correctly.

My memory of what Word does wrong is really fuzzy, but I think it has to do with footnotes that are longer than one page.

Every version, the DOJ would order thousands of upgrades of Word Perfect at full price. In exchange for that, we would pretty much fix any bug they wanted. We even wrote a printer driver for them once. If you look at the year end reports for Corel around the 2000 year mark, you will see the office suite numbers broken out. These are mostly to legal document users.

My understanding as to why Word Perfect continues to have a major portion of the legal market is the huge installed base of template documents.

Want your lawyer to prepare a trademark infringement letter? He's going to charge you several hundred dollars for basically filling in the fields on a template he created a dozen years ago. And they aren't going to do anything that threatens that goldmine, like switch word processors.

WordPerfect is no longer used by any major law firms. In 10 years of practice at one of the world's largest law firms, I've never seen a WordPerfect document.

Thanks, that was excellent reading.

Looking at your comment on how the file format is only declarative and sort of allows the implementer to decide what to do - this is similar to what HTML does. But HTML being an open spec, actually allowed implementers to deviate in some ways but there was still some minimum expected compatibilities, so its progress didn't suffer as badly.

But the Word file format was not even an open spec, so really, when people sometimes jump to Microsoft's defense, they are not even understanding how technology innovation actually suffers from closed standards which become dominant.

On the other hand, Microsoft itself could have benefited from having an open standard of the file format around when it wanted to write similar exporters. Why didn't it push for it, given the enormous power it had? Surely they put their business interests over what was the best thing to do in that situation from the engineering perspective?

I say all this because the more you hear from the insiders of that era, the more Joel's article is starting to look quite lopsided and biased, specifically this statement - "At no point in history did a programmer ever not do the right thing, but there you have it." It is nice to be able to fill out some of the missing details.


Edit: OK, are you also saying that once the initial decision was made to have a binary dump of all the data structures, there was no possibility of improving the format?

If so, does it mean that every application which had its own file format and took a similar approach as Microsoft's during that time (given that you say the original developers didn't know any better), was similarly stuck? How did Word Perfect design its file format?

Without throwing away backwards compatibility, it's pretty hard to improve the format. They could have easily done that, of course, but it was clearly not on the cards, politically. As a programmer, I think that's a bad decision, but it might have been the right decision for the user. It's hard to say.

Word Perfect actually had a vaguely non-sucky file format. When they "handed over"[1] development from the original WP team I had the opportunity to chat with them about it. They were very, very proud of having designed an actual file format rather than just dumping crap onto disk.

Essentially, it was a stream of tokens, each token representing a command. Again, it's been ages, so forgive me if I get this wrong, but you would have a "bold on" token, followed by some text, followed by a "bold off" token. For more complicated tokens I seem to remember they would embed the options in the token.

If you have ever used WP, you may have seen the "reveal codes" feature (which was WP's "killer feature"). This was pretty much the actual representation of the stream. It allowed the user to see exactly how the formatting codes were layed out so that you could fix problems (instead of trying to guess why you have some bizarre vertical space in your document as you often have to do with Word). The main problem with WP's file format was that it was very, very easy to generate illegal streams or to mismatch tokens (especially to get the options wrong between the start and close tokens). I always tell people never cut and paste in WP because it will corrupt the document eventually (a little more technical background: cut and paste was implemented by going through the RTF filter and back, which could easily corrupt the tokens on the way through).

[1] Cautionary tale: Word Perfect Corporation was a generous company. When Word Perfect became very, very popular, they rewarded the original programmers handsomely. However, instead of giving the programmers equity, they gave them gigantic raises. By the time Corel acquired WP, the original programmers were on salaries that would make even high flying valley programmers blush. After sitting out the time period that they were contracturally obliged to, Corel replaced those people with cheaper talent (roughly 1/10th the cost!). That included me. I always felt horrible about the situation, but fervently hoped that the programmers managed to save some of their ridiculous previous salaries.

Thanks for all your insights :-) Even though you feel a little conflicted about it, I would like to think all of us here at HN are very glad to know about your story.

miklos Vajna struck a problem when he implemented complex drawing shapes in LO 4.3 - in Office 2010 he found a document that displayed a green triangle, but in Office 2007 it displayed a red triangle. I believe this was under OOXML Strict...

This was a better read than the article. Touching on the same points, but briefer and from the perspective of someone who actually worked on it. Thanks!

Well, if it is a dump of memory, than what is a 'fast save' feature?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact