

Why are the Microsoft Office file formats so complicated? (And some workarounds) - unfoldedorigami
http://www.joelonsoftware.com/items/2008/02/19.html

======
WenomousVit
Joel starts out by telling you that Microsoft's file formats are _not_ the
product of a demented Borg mind and are _not_ impossible to read or create
correctly.

And then he spends the rest of the article explaining _why_ the file formats
are so demented and impossible to read or create correctly.

Which isn't the same thing at all. Just because there is a rich history of it-
seemed-like-a-good-idea-at-the-time decisions doesn't mean the end result is
any good.

There is a serious, deep, and interesting problem of scaling and complexity
management that could be discussed here. But Microsoft's approach seems to
have been one of _embracing_ complexity. And Joel's role, today, is just
defending that approach.

------
projectileboy
I call bullshit. A lot of his arguments _perhaps_ make sense for early
versions of Office, but for '97 and later? It didn't occur to _anyone_ on the
Word or Excel teams _in 1997_ that _maybe_ interoperability would be an issue?

I really enjoy Joel's blog, but sometimes I can't stand the clubbish-ness of
the old-school Microsoft brigade. "We were a bunch of geniuses, doing the best
we could..." Yeah, well, the software is usable, but it kinda sucks. Don't be
surprised if the people MS victimized for a decade sound critical now that we
can all see how the sausage gets made.

~~~
spolsky
Sure, they realized interoperability would be important. But by 1995 customers
were already whining about how often the file format changed. Nobody would buy
Office (n+1) if it would start writing documents in a format that Office (n)
didn't understand, because interoperability means precisely that you can't get
everyone to upgrade at the same time. So they did a remarkably good job of
holding the file formats stable and backwards and forwards compatible for
about ten years. Look in the Excel spec and you'll see how much work when into
allowing old verisons of Excel to preserve and roundtrip future features they
don't understand.

~~~
mojuba
Ok, then take a look at how HTML does forward compatibility, for example.
Extremely simple, elegant. Not that i defend lexable-parsable formats - I got
the point that speed was important on stone-age computers. No. It's about
architecture.

Here is what I think: nobody at Microsoft ever cared about making anything
simpler than what they got after the first approximation, because complexity
looks impressive and hence sells better. On the other hand, making things
simpler requires more intellectual efforts and usually doesn't sell well,
especially in the consumer product business.

------
codesurgeon
What is the motivation for Joel Spolsky's post? Sympathy for MS Office file
formats? Reads like a manifesto for never ever changing/cleaning up the file
formats because it took a thousand man-years to come up with the current
incarnation. He is even going so far as to defend the office file formats'
complexity and resulting maintainability hell by saying that portability was
not an issue fifteen+ years ago - well, it is now. His suggestions for file
format conversion solutions are OK, but doesn't it occur to him, that
developers don't want to be locked-in to having to shell out money for MS APIs
forever? Ties in well with Bruce Schneier's latest post on vendor lock-in
<http://www.schneier.com/blog/archives/2008/02/lockin.html>

------
tlrobinson
Normally I agree Joel's articles, but having dealt with the new garbage that
Microsoft calls a standard, Office Open XML, I can tell you that none of this
applies. Yet the Open XML spec is over _6000_ pages long (compared to these
binary specs which are a measly 100-300 pages each).

Open XML is not designed for performance... it's XML, and today's computers
are fast enough. It IS [ _supposed to be_ ] designed for interoperability
(somehow they managed to get ECMA to put their stamp of approval on it), but
in reality it feels like it's a half-assed attempt to wrap all of Office's
legacy formats in XML. For example, to import you _still_ need WMF importing,
because a lot of the graphics (including all clip art) are WMF.

------
snorkel
Joel suggests using one licensed copy of Office to run as a web service
backend but I'm not so sure Microsoft's EULA allows that.

~~~
tlrobinson
Does anyone know for sure if this is true or not?

------
henning
"The idea of things like SGML and HTML—interchangeable, standardized file
formats—didn’t really take hold until the Internet made it practical to
interchange documents in the first place; this was a decade later than the
Office binary formats were first invented."

Put another way, Microsoft didn't care about interop until they were convicted
of illegal monopolistic practices and they couldn't get away with the kind of
shit reflected in this spec anymore.

------
brlewis
MSFT is releasing specs only after the EU is basically forcing them to, and in
the year 2008. Please don't try to tell me their formats are not deliberately
obfuscated.

If a format's goal truly was to maintain forward/backward compatibility while
remaining friendly toward low-end hardware, then it wouldn't be so hard to
reverse engineer.

------
jdueck
You can't blame Microsoft for awful Office file formats. Remember, they wrote
this stuff before the web came along, and before they generally started
sucking really bad. It's just the accumulation of decades of feature creep,
add-ons, re-dos, compatibility hacks, bug fixes, and workarounds.

Eventually, all software needs a rewrite. Not just MS's.

------
phil
One option Joel doesn't mention is to use OpenOffice. It has its own object
model (called UNO), which has bindings in many languages.

It's not a perfect implementation of the formats, but it's good enough for
most things. And has the rather large advantage that you can run it under your
favorite *nix.

------
BrandonM
It seems to me like Joel has a rather warped view of how software should be
built. He claims

 _It means you have to rewrite all of your date display and parsing code to
handle both epochs. That would take several days to implement, I think._

That's just ridiculous. In my mind, you need a piece of code that reads the
1904 record and sets a flag in the code. Then, your date display and parsing
code should all call one, or maybe two, functions which handle the conversion
for you based on this flag. Thus, supporting two different epochs requires at
most three components: one to read it and set a flag, one to convert a
numerical argument to a date based on that flag, and one to convert a date
back to a numerical argument (again, based on the flag). Should this really
take "several days to implement"? It seems to me that an hour should be plenty
of time.

It's going to be hard for me to continue to take him seriously if this is his
view of how software should be built.

~~~
BrandonM
I'm fine with being downmodded, but I'm curious why. I know that Joel has a
lot of articles submitted here and that me questioning his insights is
probably not a good idea (in terms of karma, at least), but does someone
actually think that implementing code to handle different epochs should take
several days?

------
edw519
"If you simply have to produce tabular data for use in Excel, consider CSV."

That's what most of us have been doing for years, simply to avoid the mess
discussed in this article.

After reading the article, I, for one, will keep on doing it.

~~~
ghiotion
I ran into a nasty bug(?) recently in the way Excel processes CSVs. I create a
well formed CSV from a Java Web app which the user downloads. They open the
CSV up in Excel and modify the data for an eventual upload. However, during
the save process, Excel truncates trailing commas. Which causes my well formed
CSV to be a poorly formed CSV during upload and parsing. Not fun.

~~~
malkia
For such type of thing you should always have something at the last column,
even return might be enough

A,B,,,,,,,," " C,D,,,,,,,," "

That's newline there between " and "

I guess Excel loads that internally through a convertor, trims the rightmost
unused columns, and then saves it that way.

