
Why are the Microsoft Office file formats so complicated? (2008) - diziet
http://www.joelonsoftware.com/items/2008/02/19.html
======
thr0waway1239
I remember Joel himself mentioning in some article about how Excel suddenly
became dominant over a competitor (Lotus spreadsheets I think) because you
could not only open Lotus files in Excel, but you could also save the Excel
back into Lotus format (two way compatibility). I am supposing this is because
Lotus file format was not complete crap.

If that is true, then isn't it also very possible that MS had an incentive to
keep building on top of its cruddy file format without rethinking the design
at any point? Isn't it true that even today, there is no viable competitor to
MS Office even on non-Windows OS because MS has made file format compatibility
extremely difficult?

I am very glad that the latest and greatest features of Office automation
after (and including) Office 2007 were mostly still-born and we could stop
worrying about this issue and use MS Office as a tool and not as a "platform".
It became too unwieldy even for hardcore MS supporters, and even though no
good alternatives emerged, people generally decided that the programmability
of MS Office had reached its limits - and moved away to other things like
replacing them with browser based apps [1] I remember creating an Infopath
"app" as a contractor around 2010, and wondering "what is this crap? Have
these people never seen browser based apps?"

[1] This is probably a sweeping statement. I would love to hear from someone
who is certain that Office automation was a superior technology for a
particular use case.

~~~
jcoffland
> Isn't it true that even today, there is no viable competitor to MS Office
> even on non-Windows OS

I'm not sure what you mean by a viable competitor but LibreOffice works very
well for my needs and it runs on Windows, OSX and Linux. It also does a
sufficient job of reading and writing MS file formats. For that matter, Google
Docs is a pretty good competitor as well.

I think a lot of people have come to accept MS dominance as a given and then
put their heads in the sand. It's quite easy to get along just fine in
business these days without Microsoft or Apple for that matter.

~~~
sz4kerto
Try opening a large CSV file in Excel, GDocs, LibreOffice. GDocs dies around
50k rows, LO around a few 100k, Excel can cope with a million easily.

And yes, this is normal use, I don't want to use a database for doing a job
that's 2 mins in Excel. We use GDocs at the company, but it's a toy (a nice
toy though).

~~~
cel1ne
Try creating a UTF8 CSV-file that will open correctly in Excel.

Hint: It's not possible. My favourite chart about excel insufficiencies:

    
    
        Encoding  BOM      Win                            Mac
        --------  ---      ----------------------------   ------------
        utf-8              scrambled                      scrambled
        utf-8     BOM      WORKS                          scrambled
        utf-16             file not recognized            file not recognized
        utf-16    BOM      file not recognized            Chinese gibberish
        utf-16LE           file not recognized            file not recognized
        utf-16LE  BOM      characters OK,                 same as Win
                           row data all in first field
    

([https://stackoverflow.com/questions/6588068/which-
encoding-o...](https://stackoverflow.com/questions/6588068/which-encoding-
opens-csv-files-correctly-with-excel-on-both-mac-and-windows))

~~~
glaberficken
I found this unbelievable when I first came across it a few years ago.

More frustrating is when you need produce an UTF-8 CSV from excel. I always
used open office calc for this as a workaround.

To import an UTF-8 csv to excel you just run the text import wizzard and
specify the encoding.

~~~
cel1ne
Yes.

My relief in this area was discovering that most (western) non-software
companies dealing will Excel-CSV know that encodings are a problem and many
have settled on using CP-1252 instead of UTF-8.

------
alexott
Since that time, Microsoft had opened the documentation on all office file
formats, and it's really good - very detailed. The documentation support team
is very responsive when something isn't clear. Plus they provide quite many
additional tools for work with binary files, and to validate them if you're
implementing generation of these files.

But the formats themselves sometimes quite unlogical, especially when you
embed one into another. And binary formats are very different for every office
components. For example, PowerPoint - contains most of information in one big
blob inside OLE, while Excel and Word tend to store smaller objects
separately.

P.S. I've implemented data extraction for all MS office files for commercial
products, and also participated in development of catdoc program about 10
years ago

~~~
youdontknowtho
Wow. That's a positive and well informed opinion. Like a white-rhino-unicorn
on a HN Microsoft thread...

------
StillBored
Mentioned only briefly, but the document format is potentially infinite
because anyone can write an OLE/COM object that can be embedded in word
(frequently done without even realizing it via a copy/paste job). The
resulting object then gets serialized into the save file, which means that
unless you happen to have an environment that can restart the COM object you
cannot actually create something that can guarantee 100% compatibility unless
you also implement most of the windows API.

And you say, so what, edge case, but i've yet to find anything that can import
office documents with embedded visio, which seems to be all over creation in
IT/etc documents, even though visio itself now saves in an OPC type format.
This is one of the huge mac<->PC office issues and spawned a bunch of 3rd
party visio clones. (which is its own set of problems).

------
taspeotis
> Let Office do the heavy work for you. Word and Excel have extremely complete
> object models, available via COM Automation ... You have a web-based
> application that’s needs to output existing Word files in PDF format. Here’s
> how I would implement that: a few lines of Word VBA code loads a file and
> saves it as a PDF using the built in PDF exporter in Word 2007. You can call
> this code directly, even from ASP or ASP.NET code running under IIS. It’ll
> work.

It'll work until it doesn't [1]. Like if you want to do two things at the same
time.

> Considerations for server-side Automation of Office

> ...

> Reentrancy and scalability: Server-side components need to be highly
> reentrant, multi-threaded COM components that have minimum overhead and high
> throughput for multiple clients. Office applications are in almost all
> respects the exact opposite. Office applications are non-reentrant, STA-
> based Automation servers that are designed to provide diverse but resource-
> intensive functionality for a single client. The applications offer little
> scalability as a server-side solution ... Developers who plan to run more
> than one instance of any Office application at the same time need to
> consider "pooling" or serializing access to the Office application to avoid
> potential deadlocks or data corruption.

And if you want to use this functionality to service the requests of anonymous
users, make sure to read up until "[u]sing server-side Automation to provide
Office functionality to unlicensed workstations is not covered by the End User
License Agreement (EULA)."

[1] [https://support.microsoft.com/en-
au/kb/257757](https://support.microsoft.com/en-au/kb/257757)

~~~
jrcii
For over 10 years I've written and maintained Excel automation that integrates
with a web-based research database. Beyond not being supported by the EULA,
Microsoft flat out says in its documentation _do not do this_ COM is not
designed for this purpose.

~~~
teh_klev
It's not COM that's the problem. When you instantiate Excel you're firing up
an out of process instance of Excel which you communicate with over DCOM.

The problem for web apps (for example written in Classic ASP or ASP.NET) is
that each request can create its own instance of Excel, and if the request
that did this didn't complete and cleanly close down Excel then you end up
with loads of orphaned Excel instances lurking in the background. When this
happens eventually the web server runs out of memory.

Excel is definitely not intended for this type of use. Some people get lucky
because they know what they're doing, 99% of the rest should just stay away.

~~~
mistermann
Can't you just write it as a Windows service then and use some sort of an
asynchronous queue (like a file folder)?

~~~
rbag
Worked with this kind of tech for a few years... And trash it to go with
Aspose components.

It's slow, and really hard to debug when it's not working (not only pdf output
but merge fields).

~~~
teh_klev
Aspose is a complete no brainer, yet so many clients complain about the extra
cost. Somehow they don't see the hours of their own time debugging this stuff
as invariably costing far more.

------
pipio21
As someone with lots of experience reading Microsoft documents I disagree, in
my opinion Microsoft was the worst company designing formats because:

1-Their programmers were terrible designers. Companies like Apple design
first, program later. With Microsoft it was the opposite. I don't care how
good a mechanic(programmer) you are if you are bad engineer(designer) and
can't see the forest out of the tree.

2-They were experts breaking formats ON PURPOSE, it is proven that Microsoft
actually introduced bugs to break compatibility on things like DOS(to combat
competitors like DR DOS)or AVI format so people were forced to use their
products.

3-There were too much programmers(most of them not so good). While Netscape
put to work 20-40 people, Microsoft employed 1.000 to destroy competition with
Explorer. When mission was accomplished and competition was destroyed(and
nobody dare to enter the given marketplace anymore) all this people moved to
other projects like Office.

4-Perversive incentives. The social promotion under Ballmer incentived people
to create lots of bad code fast instead of little good code.

~~~
WayneBro
> 1-Their programmers were terrible designers...Apple.

The whole "Apple Good. Microsoft Bad." is really tedious and it couldn't be
further from the truth. Apple is terrible at designing software.

Apple couldn't even attempt to handle the size of software projects that
Microsoft deals with. They couldn't even build their own OS, they had to buy
it from NeXT who stole BSD and Tivo-ized it. Furthermore, if Apple built
Office or designed the Office document formats, nobody would get any of the
features that they wanted.

That's because Apple designs the simplest thing that works for the majority of
people and anybody who doesn't fit that mold is screwed. It takes a company
with real grit, like Microsoft, to get this size job done.

> 2-They were experts breaking formats ON PURPOSE...

Oh, give me a fucking break. How about a citation?

> 3-There were too much programmers...

How much is "too much"? Is 20-40 the proper size then? Please tell us, in your
infinite expertise...what is the exact number of programmers that is a good
size for building Internet Explorer 1.0 through 6.0?

> 4-Perversive incentives. ...Ballmer incentived people to create lots of bad
> code...

Could you be a little less vague? How about an example?

~~~
hrktb
For someone who doesn't want to get into "Apple Good. Microsoft Bad." level of
rethoric you seem to have a lot of grudge against Apple. To answer your
points, they built their own OS and they had their own Office suite.

They threw lots of stuff away afterwards, but from my point of view that's the
very reason why I use an Apple laptop right now. Had they stick with the OS9
base, I'd never buy their products, the same way I don't consider buying a
windows machine.

For your point 2) do you really need a citation about Office versions backward
non compatibility?

I wouldn't care to qualify if those were done on purpose or not, fact is
there's still product managers that green lit the release of products that
would break silently compatibility with older versions in small but critically
annoying ways (screwing the layout in a word doc is the last step before plain
data loss)

Point 4) about incentives, I always wonder why we ended up with Office version
installing libraries that would affect the behavior of internet explorer. For
instance as an intern I built a dynamic form for an intranet dedicated to IE
(6 I think?), but it wouldn't properly work if a recent version of Office was
not installed on the system. Of course I was writing shitty code, but yet it
boggles my mind that Office would install system wide libraries affecting the
behavior of so many applications. There was no incentive to prevent this kind
of problems, or they were just really bad designers, pick your option.

~~~
izacus
Probably the same reason you need to do a full iOS upgrade just to update
Apple Music app.

------
anovikov
I was running a team of developers who reverse-engineered Office and other
popular file formats to recover data from damaged files (that was in the 3.5''
diskette era and hard drives were prone to bad sector failures back then too,
so there was a sizeable market for that). And yes, file formats were really
complicated, which helped our business: there was a really high entry barrier.
I remember MS Exchange Server took two man-years to crack.

~~~
thr0waway1239
Well, you might be very qualified on this thread to let us know then - was the
complexity of the office file formats proportional to the feature set,
especially in comparison with the other file formats you reverse engineered?

~~~
mikekchar
I am not the original poster, but I also worked on office file formats --
specifically I was one of the poor saps who worked on file import and export
for Word Perfect after it was acquired by Corel. Before you send me hate mail,
in my defence the code was mostly written before I got to it, and I was merely
fixing the innumerable bugs in it.

I'm mostly familiar with the Word file format, so I will restrict my comments
to that. It's been more than 15 years since I did this stuff, so my memory is
hazy -- specifically I can't remember how the Excel file formats work at all.

Basically, the Word file format is a binary dump of memory. I kid you not.
They just took whatever was in memory and wrote it out to disk. We can try to
reason why (maybe it was faster, maybe it made the code smaller), but I think
the overriding reason is that the original developers didn't know any better.

Later as they tried to add features they had to try to make it backward
compatible. This is where a lot of the complexity lies. There are lots of
crazy work-arounds for things that would be simple if you allowed yourself to
redesign the file format. It's pretty clear that this was mandated by
management, because no software developer would put themselves through that
hell for no reason.

Later they added a fast-save feature (I forget what it is actually called).
This appends changes to the file without changing the original file. The way
they implemented this was really ingenious, but complicates the file structure
a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot
when this article was originally posted) is that 2 way file conversion is next
to impossible for word processors. That's because the file formats do _not_
contain enough information to format the document. The most obvious place to
see this is pagination. The file format does not say where to paginate a text
flow (unless it is explicitly entered by the user). It relies of the formatter
to do it. Each word processor formats text completely differently. Word, for
example famously paginates footnotes incorrectly. They can't change it,
though, because it will break backwards compatibility. This is one of the only
reasons that Word Perfect survives today -- it is the only word processor that
paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When
reading a Word document, you have to paginate it like Word -- only the file
format doesn't tell you what that is. Then if someone modifies the document
and you need to resave it, you need to somehow mark that it should be
paginated like Word (even though it might now have features that are not in
Word). If it was only pagination, you might be able to do it, but practically
everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are
interested. You will see large numbers of flags for things like "Format like
Word 95". The format doesn't say what that is -- because it's pretty obvious
that the authors of the file format _don 't know_. It's lost in a hopeless
mess of legacy code and nobody can figure out what it does now.

For programmers who have worked on long lived legacy systems before, none of
this should be a surprise. People think Microsoft purposely obfuscated their
stuff, but when I worked at Corel, Microsoft used to call us up to tell us
when we had broken our Word export filter. At least by that point, having Word
as a standard file format was a plus for them. However, whenever we asked them
what we should do to fix the filter, they invariably didn't know -- we knew
more than they did.

~~~
josteink
> Word, for example famously paginates footnotes incorrectly. They can't
> change it, though, because it will break backwards compatibility. This is
> one of the only reasons that Word Perfect survives today -- it is the only
> word processor that paginates legal documents the way the US Department of
> Justice requires.

That's actually really interesting. Got any more details on that?

~~~
mikekchar
It's been ages and someone told me that they are allowing Word's pagination
now as long as you explicitly say that's what you are doing (or something like
that). But basically, pagination is incredibly important in legal documents.
References are to specific pages and because footnotes can often span multiple
pages, it is important that you render it correctly (or else the reference
will point to the wrong page).

There is a specification for how to paginate legal documents in the US, but I
don't remember where to get it. IIRC it is based on the Chicago Handbook of
Style. The other place you can find the correct explanation of pagination is
in the TeX source code, because it does it correctly.

My memory of what Word does wrong is really fuzzy, but I think it has to do
with footnotes that are longer than one page.

Every version, the DOJ would order thousands of upgrades of Word Perfect at
full price. In exchange for that, we would pretty much fix any bug they
wanted. We even wrote a printer driver for them once. If you look at the year
end reports for Corel around the 2000 year mark, you will see the office suite
numbers broken out. These are mostly to legal document users.

~~~
chiph
My understanding as to why Word Perfect continues to have a major portion of
the legal market is the huge installed base of template documents.

Want your lawyer to prepare a trademark infringement letter? He's going to
charge you several hundred dollars for basically filling in the fields on a
template he created a dozen years ago. And they aren't going to do anything
that threatens that goldmine, like switch word processors.

~~~
_acme
WordPerfect is no longer used by any major law firms. In 10 years of practice
at one of the world's largest law firms, I've never seen a WordPerfect
document.

------
Klathmon
I really enjoy this article. It's nice to see one thats not just bashing the
format as bad or insulting the developers.

Did Microsoft solve any of these problems in their "newer" file formats (IIRC
its something like .docx instead of .doc)? And are those as crazy after a few
years or have things gotten better since then?

~~~
niftich
Back in 2007 I was a proponent of the 'No-OOXML' movement [1], which outlined
several objections (linked in the left margin on their page), some technical,
some political, to the ISO standardization process of the new XML formats.

In my opinion, the new formats are little more than XMLified representations
of implementation details of how MS Office works; there is a distinct
proliferation of elements that aren't purely semantic, and an awkward-but-
inconsistent lack of separation of structure from presentation. But I'm not an
expert on the binary formats nor the inner workings of their applications.
Though my opinions remain unchanged after 9 years, in retrospect I do like
that the new formats are documented by a true standards body, and are
interoperable not just by reverse-engineering a closed format.

[1] [http://noooxml.wikidot.com/start](http://noooxml.wikidot.com/start)

~~~
frik
> Documented by true standard body

Is ECMA that "true", I don't know. At least they document a JavaScript without
"JavaScript" in its name. MS Office (2010 at least) isn't compatible with the
standard text. Is there even a MS Office nowadays that's better? Also Office
2010, dispite support, is hostile to OpenDocument format - opening a document
gives you warnings about legacy format and what not.

The OOXML standatisation process was very controversial and left a bad taste
around MS and ECMA Switzerland:
[https://en.wikipedia.org/wiki/Standardization_of_Office_Open...](https://en.wikipedia.org/wiki/Standardization_of_Office_Open_XML)

~~~
davidgerard
ECMA 376 is explicitly intended as documentation of what Office 2007 does.
Michael Meeks of LO (then at Novell) worked on it and says it did this job:
[https://people.gnome.org/~michael/blog/2014-02-26-cabinet-
of...](https://people.gnome.org/~michael/blog/2014-02-26-cabinet-
office.html#tc45)

But in practice there's no such thing as "OOXML" \- there's only what Office
2007, 2008, 2010, 2011, 2013 and 2016 happen to do. So if you want to work
with the stuff, you verify everything. The standard is, like so much Microsoft
documentation, best treated as based-on-a-true-story fiction with an
unreliable narrator.

------
glaberficken
Laziest way to output an excel file: Write an html table to a file and make
the extension ".xls"

This has the minor niggle that it will throw a warning to the user ("The file
format and extension of __*.xls don 't match...")

But it has the "advantage" over the csv approach that you can include
formatting via hmtl/css styles.

~~~
space_ghost
That's broken now in Office 2013 and Office 2016. Several of my users -at the
same time- found that they could not open "excel" documents provided by
several different vendors, all of which were actually HTML tables in files
named ".xls". My guess is that Microsoft broke this with a software update,
but I've yet to research the problem.

~~~
glaberficken
This is caused by a security update, details and workarounds on the link below

[http://www.infoworld.com/article/3098898/microsoft-
windows/e...](http://www.infoworld.com/article/3098898/microsoft-
windows/excel-refusing-to-open-files-blame-the-kb-3115322-3115262-security-
updates.html)

------
gavinpc
> These are binary formats, so loading a record is usually a matter of just
> copying (blitting) a range of bytes from disk to memory, where you end up
> with a C data structure you can use. There’s no lexing or parsing involved
> in loading a file. Lexing and parsing are orders of magnitude slower than
> blitting.

Yeah, well that was _eight years_ ago. There's no way that, immediately below
this item on Hacker News would be another item boasting that

> Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark
> is, of course, unfair. It is only measuring the time to encode and decode a
> message in memory. Cap’n Proto gets a perfect score because there is no
> encoding/decoding step. The Cap’n Proto encoding is appropriate both as a
> data interchange format and an in-memory representation, so once your
> structure is built, you can simply write the bytes straight out to disk!

~~~
venning
I'm sure that's why this article is listed here today.

This comment on Cap'n Proto came one hour before this article was re-
submitted:
[https://news.ycombinator.com/item?id=12471541](https://news.ycombinator.com/item?id=12471541)

------
makecheck
The problem is compounded by users not knowing exactly what it is that they’re
storing/transmitting/archiving, and not caring enough to push for something
leaner and more accessible.

It is also sad to see people deleting things to “make space” within some
quota, when a massive amount of storage is clearly caused by unnecessarily-
bloated files. How do you convince them to abandon the only editors that are
familiar to them? How do you make plain text the new default?

People seem to like cracking open Word, typing a couple paragraphs and
“sending that” to everyone. They don’t realize the kitchen sink comes with it.
In the old days, sending large file attachments to an entire organization
could be a disaster. If the server wasn’t very smart then it encoded and
COPIED some monstrous Word file to _everybody’s_ Inbox. And copied it again
for reply-all. Inboxes would run out of disk space because E-mail sizes were
insane! It was really frustrating to see that the amount of _useful content_
was so small compared to the footprint.

------
userbinator
Are they really "so complicated", or is it just a large amount of options,
many of which might actually be ignorable for the task you're doing, that
contribute to such an impression of complexity?

 _They were designed to be fast on very old computers. For the early versions
of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an
80386 at 20 MHz had to be able to run Excel comfortably._

To me, this suggests the opposite --- a complicated format would be difficult
to parse efficiently. I found this document:

[https://www.openoffice.org/sc/excelfileformat.pdf](https://www.openoffice.org/sc/excelfileformat.pdf)

...which shows that it's basically TLV, so for simple data extraction it
doesn't seem so difficult after all. For example, if you're just after cell
values, you don't have to care about fonts, printing, and other formatting
info.

~~~
IANAD
> Are they really "so complicated", or is it just a large amount of options,
> many of which might actually be ignorable for the task you're doing, that
> contribute to such an impression of complexity?

Apache's library called it HSSF, Horrible Spreadsheet Format. You make the
call.

:)

~~~
brianwawok
I have literally used it for YEARS and never knew what HSSF meant.

[https://en.wikipedia.org/wiki/Apache_POI](https://en.wikipedia.org/wiki/Apache_POI)
was a fun read

HPSF (Horrible Property Set Format) HSMF (Horrible Stupid Mail Format) DDF
(Dreadful Drawing Format)

~~~
userbinator
Amusing, but I also have the feeling that it's difficult for Java developers
to appreciate the benefits of a binary format because the language doesn't
make it easy to work with them. XML, on the other hand...

------
adsims2001
"A lot of the complexities in these file formats reflect features that are
old, complicated, unloved, and rarely used. They’re still in the file format
for backwards compatibility, and because it doesn’t cost anything for
Microsoft to leave the code around."

That's not really true. It is not free to leave old code around.

~~~
jlarocco
It's not free, but taking the code out isn't free either, and keeping it
around makes it easier to sell upgrades to newer versions.

------
dredmorbius
How do MS Word and Excel file formats compare to the file formats of
WordPerfect, AmiPro, WordStar, Lotus 123, Borland Quattro, etc?

~~~
yuhong
They themselves suggest writing Lotus 1-2-3 files at the bottom of the page.
Unfortunately Excel 2007 and later no longer supports it.

~~~
dredmorbius
What I'm edging at is that for all of Joel's apologia for Microsoft Office
formats, other comparable tools of the era were vastly more sensible.

Though I didn't find a doc for WordPerfect file format spec offhand.

~~~
eropple
"Sensible" cuts across a lot of axes. Those comparable tools, for the most
part, didn't have the same sorts of OLE wizardry that are _really pretty
useful_ for normal, non-technical people. That comes at a cost of technical
complexity.

~~~
dredmorbius
By "sensible": it was possible to implement independent, reliable, read-and-
write capable tools for these formats.

OLE was a tremendous boon to lock-in on the part of Microsoft. It was useful,
yes, but hardly flexible. There are other tools which offer comparable
capabilities without the lock-in elements.

Hal Varian (currently Google's chief economist) wrote the book on vendor lock-
in, and how to both secure and avoid it (as a customer) in the late 1990s.

~~~
eropple
There weren't comparable tools in 1990. There still aren't, for a lot of
things--something as simple as embedding a spreadsheet table in a document is
_still_ not really feasible except through similarly closed mechanisms.

Your attempt to define flexibility solely as "the use of non-Microsoft tools"
while casting aside exactly what OLE _does_ for non-technical users is pretty
transparent.

~~~
dredmorbius
There's little notional difference between embedding one software tool within
another, and calling one from another.

There's a considerably simpler architectural structure for the latter.

You still need the full multi-application support available. We're doing that
today with browsers (the universal document reader) and plug-ins. Which are
generally being considered a Bad Idea, and functionality (e.g., PDF readers,
video) now being natively supplied.

I'm defining flexibility as a lack of arbitrarily-imposed constraints. Which
is what the text on lock-in I referred to discusses at length.

I'm well aware that simple and expedient solutions often end up being long-
term untenable. This doesn't mean that they're not simple and expedient in the
first place. Though that simplicity often comes from the capacity to impose a
single standard across an internally consistent (at least on a point-in-time
basis) architecture.

Information technology vendors have long exploited the matter of standards to
self-serving benefit. Microsoft were not consistent in either supporting or
opposing standards. They _were_ consistent in applying standards policies to
their own benefit. Promotion of the IBM-PC compatibility standard increased
the platform for Microsoft OS and applications sales. Hindering standards such
as Ethernet, Internet, HTML, office application formats, Silverlight, OLE, AD,
Exchange connectors (POP, IMAP), etc., was also strategically pursued.

You're focusing on the software specifics rather than the strategy. Yes, the
trees are lovely, but there's a forest you might care to observe.

------
bitwize
Because when you dump memory contents to disk, unswizzle the pointers, and
call that your file format, things stop being simple.

~~~
chris_wot
It could have been worse, it might have been mork.

------
artursapek
Reading this makes all of the personal struggles I have had writing software
seem so petty.

------
plorg
It was difficult getting past the words Windows Meta File, the cause of one of
my many software headaches in graduate school. I never found a way to export
vector graphics from Matlab (admittedly not a Microsoft product) in a way that
they could be embedded in LibreOffice (or any other open Office clone) to
produce PowerPoint-compatible documents. But I certainly tried.

If the requirements were just to produce a document I could have generated
PDFs, but my department head wanted PPT slides. Apparently there is no other
vector graphics option that is compatible with both Matlab and Office, and
there are zero useful tools for editing WMF files, save for the pathetic
options available as part of the Office suite.

------
sidlls
It might take thousands of work years to recreate Office from scratch and with
an identical timeline but I categorically reject the notion that a clone from
scratch without that historical constraint would take that long.

~~~
Klathmon
Well it would take less time to make a clone, but I'd argue it might take
longer to fully support the format in a program that was not designed for it,
and is trying to use the format in different ways than it was designed.

------
mkhpalm
You unfortunately cannot "just use csv" if you're dealing with modern tabular
data. (non-ascii)

Here is an example of basic BOM ignoring and delimiter insanity demonstrated
by SAP:
[https://wiki.scn.sap.com/wiki/display/ABAP/CSV+tests+of+enco...](https://wiki.scn.sap.com/wiki/display/ABAP/CSV+tests+of+encoding+and+column+separator)

Long story short, Excel is to tabular data as IE was to HTML/CSS/Javascript.
Its headed for a cliff once people start realizing how bad it is as doing
basic tasks.

~~~
ygra
If you're targeting Excel and generate CSV, then use UTF-8 with signature and
include the sep= line at the start. Joel mentioned it as an alternative to
creating Excel files, i.e., creating a file specifically for Excel to read. If
that's the goal you don't need to care for other CSV readers and simply make
things nice in Excel.

SAP probably cannot do that.

~~~
kuschku
SAP can do that, but the idea is to have standard support.

And that means that any program reading or writing CSV should support all
possible standard-compliant CSV files.

You might get a CSV file from someone who created it with German Excel and
want to read it into French Excel (it won’t work).

------
youdontknowtho
With all of these comments that know so much better about everything related
to software than Joel or Microsoft, no wonder so many of you have written
successful office suite software that has a billion customers.

Oh wait...

~~~
thr0waway1239
Most people who read mikekchar's comments would now have a second viewpoint of
the story which differs from Joel's version.

Also please do tell us exactly what you have created so we can also gauge
precisely what you are qualified to talk about in the future using your own
metric.

And finally, since your comments on this story also obviously add a lot of
substantive value to the discussion, we are now even more fascinated to have
heard your very compelling argument in favor of Joel's version.

Oh wait...

~~~
youdontknowtho
Thanks Mr thr0waway1239. I would love to answer your questions!

I've never built anything even remotely approaching the scale of MS
Office...that being said, I try not to take the experiences I do have and use
them to belittle the work of people in vastly different scenarios. To read the
comments in this thread you would think the binary format's of the Office
Suite were proof positive that the people at MS were completely incompetent
AND Machiavellian enough to make software that worked, but obfuscated it's
internals to throw off the most wiley reverse engineers. It reminds me of the
way that Republicans say that Obama is the most incompetent bumbler and the
most tyrannical schemer playing 3 dimensional chess while continually tripping
over his own shoelaces.

That was my point. Statistically, most people never do anything great, yet
there are so many opinions about how to do great things...we know that most of
them must be wrong. Just not OUR opinion, right? We KNOW the right way to do
things...not like those assholes...they...blah blah blah.

I thought that Joel's article had some good points. It's good to hear some
inside knowledge on things sometimes. His point about using libraries to
extract data was pretty much the best idea going for a long time when dealing
with those documents. I would say that if you have to deal with that stuff
these days that the Microsoft Graph is the easiest way to do it. It's a REST
api that you can use to create and modify different Office objects. That's
just me thinking "how do I do this in the least amount of effort", though. I'm
not saying that anybody else should be doing it.

The thing is that most of the anti-Microsoft sentiment here is just knee-jerk
red team/blue team BS. It's more tribal than it is anything else. I just wish
that some of my programming brethren were a little more willing to question
some of their own assumptions.

So thanks for genuinely engaging me in an open dialog about honest
conversation...

oh wait...

------
sterex
What is said here holds true in 1995. Post 2000, interoperability was pretty
evident. Microsoft is simply scared that if they made a simple enough format
they would lose the hold they have on the Office software.

Computers are faster and robust to handle I/O much better now. If they were by
any means a little ethical, they would have created a simpler format and built
on top of it instead of changing the entire spec every iteration of their
software.

Software does not evolve by retaining irrelevant code.

------
hitr
The latest versions of office support open XML formats(docx,xslx etc) which
are XML files basically. It's not mentioned in the article but using open XML
SDK is the supported way for working with office docs
[https://msdn.microsoft.com/en-
us/library/office/bb448854.asp...](https://msdn.microsoft.com/en-
us/library/office/bb448854.aspx)

------
pritambarhate
A little bit off the topic: But does anyone know if Google releases the file
format specifications for their Google Apps Office products?

As far as I know, you can't even download a file in the original format. You
have to download the Word / OpenOffice version instead.

------
be5invis
It reminds me the format of fonts. That is, your font is actually a program, a
program used to draw shapes. There are even subroutine calls in it, and it is
widely used to reduce file size.

~~~
vram22
PostScript.

------
kuharich
Previous discussion:
[http://news.ycombinator.com/item?id=118909](http://news.ycombinator.com/item?id=118909)

------
gerfficiency
Microsoft has a document (KB257757) strongly recommending you not try to use
Office server-side. They have some recommendations on what other things you
could do instead.

------
youdontknowtho
So let me get this straight...

Microsoft programmers where terrible programmers that were expert enough to
break code for other people while still having it work well for their
customers?

------
flamedoge
I don't understand why they have to be when w3c can get html right

~~~
andrewguenther
Since when did the w3c get html right?

