Hacker News new | comments | show | ask | jobs | submit login
Why are the Microsoft Office file formats so complicated? (2008) (joelonsoftware.com)
393 points by diziet 231 days ago | hide | past | web | 227 comments | favorite

I remember Joel himself mentioning in some article about how Excel suddenly became dominant over a competitor (Lotus spreadsheets I think) because you could not only open Lotus files in Excel, but you could also save the Excel back into Lotus format (two way compatibility). I am supposing this is because Lotus file format was not complete crap.

If that is true, then isn't it also very possible that MS had an incentive to keep building on top of its cruddy file format without rethinking the design at any point? Isn't it true that even today, there is no viable competitor to MS Office even on non-Windows OS because MS has made file format compatibility extremely difficult?

I am very glad that the latest and greatest features of Office automation after (and including) Office 2007 were mostly still-born and we could stop worrying about this issue and use MS Office as a tool and not as a "platform". It became too unwieldy even for hardcore MS supporters, and even though no good alternatives emerged, people generally decided that the programmability of MS Office had reached its limits - and moved away to other things like replacing them with browser based apps [1] I remember creating an Infopath "app" as a contractor around 2010, and wondering "what is this crap? Have these people never seen browser based apps?"

[1] This is probably a sweeping statement. I would love to hear from someone who is certain that Office automation was a superior technology for a particular use case.

> Isn't it true that even today, there is no viable competitor to MS Office even on non-Windows OS

I'm not sure what you mean by a viable competitor but LibreOffice works very well for my needs and it runs on Windows, OSX and Linux. It also does a sufficient job of reading and writing MS file formats. For that matter, Google Docs is a pretty good competitor as well.

I think a lot of people have come to accept MS dominance as a given and then put their heads in the sand. It's quite easy to get along just fine in business these days without Microsoft or Apple for that matter.

Try opening a large CSV file in Excel, GDocs, LibreOffice. GDocs dies around 50k rows, LO around a few 100k, Excel can cope with a million easily.

And yes, this is normal use, I don't want to use a database for doing a job that's 2 mins in Excel. We use GDocs at the company, but it's a toy (a nice toy though).

Try creating a UTF8 CSV-file that will open correctly in Excel.

Hint: It's not possible. My favourite chart about excel insufficiencies:

    Encoding  BOM      Win                            Mac
    --------  ---      ----------------------------   ------------
    utf-8              scrambled                      scrambled
    utf-8     BOM      WORKS                          scrambled
    utf-16             file not recognized            file not recognized
    utf-16    BOM      file not recognized            Chinese gibberish
    utf-16LE           file not recognized            file not recognized
    utf-16LE  BOM      characters OK,                 same as Win
                       row data all in first field

It is amazing how badly Excel handles CSV files. My favorite thing is where it will decide for some goddamn reason that a numeric or string column is actually supposed to be a date, or a currency, or some other random thing, and automatically format the column. Cue the calls from customers that don't know how to turn this "feature" off, when they export something as CSV and think it is corrupted.

This misfeature is also locale dependent. It's impossible to know how your customer's Excel will behave, unless you test every possible localization.

I found this unbelievable when I first came across it a few years ago.

More frustrating is when you need produce an UTF-8 CSV from excel. I always used open office calc for this as a workaround.

To import an UTF-8 csv to excel you just run the text import wizzard and specify the encoding.


My relief in this area was discovering that most (western) non-software companies dealing will Excel-CSV know that encodings are a problem and many have settled on using CP-1252 instead of UTF-8.

Hi, you should check out my shell script gist that helps you with it. Remember to change the path to your LibreOffice executable https://gist.github.com/jhanschoo/bcd6d08d1e0f4b5d52735287f3...

CSV is a poorly defined format. It doesn't have a standard. No wonder it screws up when you start varying character encoding.

The first practical lesson of CSV is CSV WILL fail miserably when changing character encoding, using non ASCII characters, having quoted string, varying end of line characters...

> CSV is a poorly defined format. It doesn't have a standard.

Wrong: CSV is standardized in RFC 4180:

> https://tools.ietf.org/html/rfc4180

Ex post facto. It's been a "standard" long before the RFC, so there are many CSV documents which need to be read but don't confirm to the RFC spec. You should create a CSV writer that conforms, but your reader needs to support much more. Then if you save back to the RFC spec, the document might break other downstream readers which were written long ago and which don't follow the new RFC spec. It's an imperfect file format.

Indeed. Differences in encoding or presence/lack of a BOM in UTF8 are a small set of issues. I've dealt with financial data from very large public banks and vendors that are horrible about vending proper CSV data. Either impromper or unescaped quotes (or delimiters). Takes weeks to get the data corrected, ao you manually correct the data or write a custom parser to try and navigate around it. One of my faves was a vendor that didn't escape the legal name of a company, so "Microsoft,Inc" would be 2 fields, rending most of that recors gibberish.

I generally agree with the idiom "If you write your own cav parser, you're doing it wrong", but sometimes you have no choice to work around the idiocy of others.

I don't know when it became "easy" to handle a million rows in excel, but I do know that some time ago that was intractable. As late as a few months ago, I recall trying to work with ~6 million rows in excel and it flaked out miserably.

For quick, ad-hoc tasks on huge files I recommend R/RStudio.

I also use Notepad++, of all things, for cleaning-up csv data because its "operating envelope" in terms of file-size and sheer number of files is unlike anything else. The regexp search/replace combined with zoomed-out views allow for easy clean-up of text data.

Excel 2007. Prior it was impossible due to file format limits. Row limit was 64k, i think. Not sure what the Col limit was. Not sure what or if there limits are since the move to the xlsx format.

Despite doing Excel addin development in finance, i never hit a limit in post 2007 excel.

No joking, I find Sublime Text brilliant to work with humongous CSV files :-)

Which version? The latest 3.x version, that I use, almost stalls trying to load a file of relatively small size of 20 odd MBs.

With 2.x I've opened and edited with multiple cursors (my favourite feature) files around a GB in size. On Windows, if that makes a difference.

Basically thats a lie. Yes Office is good in many ways and especially ease of use. But opening a CSV file in Office > 2013 vs LibreOffice that contains more than 100k is ways more smooth in the newer LibreOffice version than in newer Office versions. If you would've written anything about UX or usages or Format interoperbility, rendering their own format etc than yes Office is a clear winner, but when it comes to big files and their resource usage the newer office versions are just bad.

> Basically thats a lie.

Well, that is how it worked for me, I used Excel 2016, the most recent LO (5?), and Google Spreadsheets. Around two months ago. I wanted to create a chart (histogram) from data in a CSV, with around 900k lines. Calling this a 'lie' is just bad manners, sorry.

And for many of my use cases (although not all) LO/OO was way ahead of Word at times (formatting using styles comes to mind as a place where I used to prefer old OO for years and I also think LO/OO is less prone to misread basically anything as an American formatted date.)

Yeah if you have a document written in LO/OO, that's true however loading still some XLS files is just akward. I mean probably due to the spec and even microsoft sometimes had bugs in their program so that at some things the spec didn't helped. like some table bugs etc. But still I don't wanted to says XYZ is better overall or so. I just wanted to point out that resource usage and loading bigger files is definitly not the thing were office shines at least the newer versions and also don't expect office for mac to be good, it sucks.

Back in 2005 ish I used to have access to both but still import to Writer to clean up after three other editors who didn't get (or care maybe) about styles, the export to Word again because Writer was missing some page numbering feature or something.

It is a common case in helpdesks I've collaborated with to fix corrupted Excel files by opening them in LibreOffice and resaving them to Excel. Also, Excel's practice of munging CSVs requires an initial read in LibreOffice sometimes and then saving to Excel when Excel simply can't handle the file format.

LibreOffice is slow but it is often more correct.

Update: Over the years, it seems to me that the MS formats are incented to be as incompatible as possible with other formats to lock folks into their format. It enables them to say that they are the most "compatible". . . but it really means the most "ability to read their own format" even though compatibility is better across the board with alternatives.

Have you tried Gnumeric? It's supposed to be fast at handling large files.

It is! Just imported a 3 million line csv with zero problems.

Excel can(at this day) handle 1,048,576 rows which is 2^20. If you really need to work on bigger DBs than that I'd recommend using Power Pivot(natively included in Excel 2016) which is supposed to handle way more than 2^20 rows.

I agree. GD/LO spreadsheets are an appalling database engine.

I don't want to tell you how to do your own job, but if it's normal to handle millions of lines in a spreadsheet, you might be doing it wrong. (2 mins in excel; 2 seconds in a db)

It's not clear what you're doing with your data, but sqlite may be a viable option as well.

Using a million row sample from here:


Taking a summary of the data:

  > cat csvtest.sql
  .mode csv
  .import majestic_million.csv rankings
  select TLD,count(TLD) as cnt from rankings group by TLD order by cnt desc limit 10;

  > time sqlite3 < csvtest.sql

  real	0m6.295s
  user	0m6.089s
  sys	0m0.156s

Funny, I've always considered Excel (and all the others) equally bad at scaling because none of them can cope with csv files over 1 million rows. 50k or 1 million isn't a big difference. Where's the vi of spreedsheets for handling those 300 million row db dumps with a load error on line 253,452,624?

CSV is just plain text, so vi can work... But as others have said, try it with Gnumeric.

Million row Excel doc that you only need to do 2 mins worth of stuff to? I'm curious what your use case is, can it not be completely automated?

Taking a log file, grep, sed to extract some numbers and then create a graph that shows the distribution of those numbers (histogram). It was one-off, I could have used some command-line tools to do all of those but then I'd have spend an hour learning how gnuplot works, which I didn't have time for. I'm a developer, and that's exactly why my time is expensive -- automating everything doesn't always make sense.

Depends on how often you do the task (and similar), of course. If it's frequently, then not learning a faster technique is getting stuck in a local optima.

For this type of task I usually spin up an ipython/jupyter notebook and hammer something out with Pandas very quickly. It's an incredibly productive toolkit for quickly performing this type of task.

The task at hand could just be to answer a quick one-off question about the data. Not worth bothering the developers for that.

Indeed Libreoffice cannot handle large CSVs which is a shame, because otherwise it's an excellent tool.

Gnumeric works better for large files. Also, Open-Source.

I've run into excel problems at 64k rows. It was a while ago, but this century at least.

I think 2007 was the first version to support two million rows.

For basic things yes. For more advanced features, I can't think of any competitor.

As a result office has pretty much stopped evolving the last 10 years. The only thing they are working on is making it more cross platform / web friendly. But it's pretty much a dead ecosystem.

> For basic things yes. For more advanced features, I can't think of any competitor.

For features above the basic level, Microsoft Office is far from being an alternative as well. For example, try to reformat the figure numbering scheme, or add an appendix to an Microsoft Word document. Very basic stuff, and yet Microsoft Word fails to offer a sane option to do so.

Microsoft's offerings are only perceived as being above adequate because people don't know better.

And what would be better?

On one front, LaTeX. On another, e.g. Framemaker.

But as an overall documentation tool MS Word is hard to beat because you can still do most things, and it is ubiquitous.

In my experience, the worst problems with Word come out of company templates. Plain Word out of the box causes less surprises.

Most of my troubles with word have to do with most of the properties of the items on the document being invisible. You are dealing with invisible page breaks, invisible margins, invisible objects, etc.

What would be really powerfull is a dual markup/WYSIWYG UI, like WPF in visual studio. Editing xml attributes on a markup is way more efficient than trying to guess on what you clicked and then going to menus and nested sub menus. But typing raw xml is not fun, you want to be able to type in the document directly. And you don't want to expose the markup to unsophisticated users. The markup should be way more defensive than a text editor. It should not allow you to place a node where that type of node is illegal. It should make autocomplete adds the mandatory features when typing a new node (like the closing tag, the mandatory attributes, etc). Otherwise we will spend our time dealing with corrupted documents.

Same for powerpoint, which in my opinion has pretty much converged with word. Most decks created are not on screen presentation but rather rich visual text documents, just text document with a page orientation that fits modern screens.

What would be really powerfull is a dual markup/WYSIWYG UI, like WPF in visual studio

WordPerfect had this in the '80s. It's a conventional word processor, but you could press the "Reveal Codes" function key and get a split-screen. The lower half would show all the markup. You could see and delete a tag explictly, and its matching one would also go away. Though of course, the markup was not indented.

My mother, -- a secretary -- used it all the time. But when she met other secretaries who came from an MS-Word backround, they hated the very existence of the feature. I think the last few editions of WP removed it (in order to cargo cult Word?).

Then they went out of business.

Word has had that for a long time. In modern ribbon-interface Word versions, it's in the Home-Paragraph section, top right icon ("paragraph mark"). Additionally you can/should toggle field code display with Alt-F9.

This has been in Word since, I think, Word for Windows 2.0 (circa 1992, for Windows 2.11). It might have also been in the previous DOS-based Word versions that co-existed with DOS-based WP (but since I used both, I can't always tell which features were in which).

Do you mean those strange cryptic characters at the end of the paragraphs?

That is not a lot like reveal codes, or like a markup language. Though I agree they are handy.

Maybe not markup language, but the idea is to show the markup. To get a paragraph break, you still hit Return and don't try to type that end-of-paragraph marker.

The field codes are more like an actual language; you can insert a field and then type the content in it.

> Most of my troubles with word have to do with most of the properties of the items on the document being invisible. You are dealing with invisible page breaks, invisible margins, invisible objects, etc.

WordPerfect doesn't have this problem:

> https://en.wikipedia.org/w/index.php?title=WordPerfect&oldid...

"The Reveal Codes feature, imperfectly imitated much later in Microsoft Word, is a second editing screen that can be toggled open and closed at the bottom of the main editing screen. It was especially beloved of those who were faced with Microsoft Word, which had at the time no similar feature. The codes for formatting and locating text are displayed, interspersed with tags and the occasional objects, with the tags and objects represented by named tokens. The scheme makes it far easier to untangle problems than with styles-based word processors, and object tokens can be clicked with a pointing device to directly open the configuration editor for the particular object type, e.g. clicking on a style token brings up the style editor with the particular style type displayed."

This is what commercial editors for Docbook and DITA allow for, yet you don't see many keen in embracing them.

> In my experience, the worst problems with Word come out of company templates.

Oh, yes, brother. Especially as they tend to be created by the people with the least experience in using (even modestly) advanced features.

I work for a major international corporation which official powerpoint template has hardcoded numbers on each slide, no auto-page numbering...

> because you can still do most things, and it is ubiquitous.

That's also true for LibreOffice or other FLOSS office projects.

In fact, a copy of LibreOffice doesn't cost anyone between 100€ and 300€, or requires any subscription.

Still, in the corporate world most people do not have LibreOffice, but they do have Microsoft Word.

And in very many cases they can't even install software themselves, as the computer is managed by the IT/IM organisation. The cost of license is negligible in this context; the actual cost comes from managing something non-standard.

Yes, you or me don't like it but it's the fact in many corporations, government organisations and even educational institutions.

OTOH, I used to work as a govt contractor where they routinely give 1000-page software specifications with screenshots and non-unified titles. There was no safe way to save them, let alone work concurrently, so let's just say they're not the right tool at all for most routine tasks in large companies.

What advanced things can Office do that LibreOffice can't?

I don't know for sure, but I'm willing to bet that there are sophisticated mail merge features in Office that don't exist in LibreOffice. The other major killer feature that I'm aware of in Office is the track changes feature. There are work flows that absolutely depend on that for cooperative editing and sign-off.

You can do these things other ways, but once the workflow is entrenched it would be really hard to move people to something else. If you've never seen some of the insane mail merges that people do, you might think it would be trivial to implement, but alas it is not.

Having said those things, it's been quite a while since I touched a word processor (5 years or so). When I last used Libre Office there were some other obscure bugs/features that made me want to throw it out the window. The main one was the way it formats multi-lingual text. If you try to write text that switches between roman and asian fonts, good luck in keeping the baseline in place. I made an attempt to convince people to change the idiotic way it is done (by shrinking the font size!) but had no luck.

I'm not a fan of Office either for the record, but I can see why some people do not want to change.

The mail merge functionality is constantly being enhanced. In 5.1 mail merge embedding was released - see here: https://archive.fosdem.org/2016/schedule/event/libreoffice_m...

You can track changes in LO also - starting in 4.4 it has been progressively getting better - see here:


Curious about the font issue... I've never tried to use Asian and Roman glyphs on one line before. But multilayout is a bit of a mess and is constantly being worked on.

> The other major killer feature that I'm aware of in Office is the track changes feature.

This is by far the worst "feature", it's a plague that management enforces on its employees mainly not knowing any better or just being plain ignorant. As you said this comes down to unwillingness to change and learn something new. In most cases using plain text format like asciidoc/asciidoctor with some version control system would be enormous improvement.

I am a government attorney by day and a hobby programmer at night. There is absoloutely no way you are going to get any significant number of office workers to use asciidoc and a VCS.

It's hard to even get a lot of people trained in Word. The agency has training classes when a new version is installed--which is once every few years. People struggle with stuff like the ribbon. To use asciidoc and a VCS people would have to start from scratch training wise. Not gonna happen.

I routinely format sizable documents and have to collaborate on them. Word is an abomination and I hate it. From a technical standpoint other tools are far superior. But a bunch of office workers are not going to learn anything else. It would not be worth the effort. Since MS has no incentive to improve Word I grit my teeth and accept that I will have to deal with its garbage.

My personal take is that people have difficulty with change in computing because computers are so damn good at irrecoverably destroying data.

Track changes is not just for "version control", and they don't want plain ascii text anyway.

I like LibreOffice because it's FLOSS and supports Open Document (formats). I use Calc for several of my personal spreadsheets. However, I personally do not like a few things with LibreOffice, which I think help the notion among people that MS dominance is a given and that solutions like LibreOffice are not good enough:

1. It does not support MS Office keybindings and its menu structure by default (not referring to the ribbon interface here). That's not a great thing for an application that would like more people from the MS Office camp to use it. As someone who's used MS Excel for a long time and learned several keyboard shortcuts, I still take more time to accomplish even simple things in LibreOffice and am learning different keyboard shortcuts to use Calc. Ideally, it shouldn't be this way. I don't understand what LibreOffice gains by adopting different keyboard shortcuts.

2. The UI (icons, menu fonts, colors, etc.) looks primitive and needs a redesign to keep up with the times (I've mentioned this in some other comments on HN).

3. It lags behind in feature set and ease of use compared to MS Excel (as one example, try creating, editing and changing charts in both). This is not really as big a deal for most users compared to the two points above.

This may seem a minor gripe, but last time I used LibreOffice it didn't have a way to visually crop an image just like Word or Google Docs can, which is disappointing for the sort of work I do, especially as I'd like to only use Free Software to do it.

It means I need to spend more time cropping the image in an external program, and I can't adjust the dimensions later. It's a hassle.

> It also does a sufficient job of reading and writing MS file formats.

Except for powerpoint presentations.

It's not that the file format was any better, it was just much simpler because in the DOS days you couldn't arbitrarily embed content from one application into another. Lotus' file format was just as proprietary as Microsoft's was. However, Excel had very good file importers, but also things like keystroke migration wizards etc. Microsoft invested a lot of time and money into doing everything they could to lower the transition barriers into Excel specifically for Lotus users. (and Office apps in general. They had similar efforts for WordPerfect->MS Word migration and others)

It was a fairly one-way trip in that even in the first versions of Excel, getting data back out was never as easy as getting it in was (i.e. the exporters weren't as complete as the importers... probably by design) and throughout the 90's Microsoft definitely built on and expanded the one-way nature of migration to their proprietary file formats as a key to lock-in. It wasn't so much that it was all that difficult as it was undocumented and always changing (there were multiple backwards incompatible file formats in each of the main MS Office apps in the 1990's). So if you were a customer or competitor, it wasn't so much that you couldn't reverse engineer the file format, it's that they were relatively undocumented and constantly changing so by the time you had done it, they were on to the next revision. I think it was a couple of years after the DoJ case, and a result of the settlement terms, that things really started improving on this front... but by then Microsoft had long since won. By that time most of their competitors had already figured out that the only way they could hope to compete with Microsoft was to bypass them entirely... hence the focus on web apps.

You mention web apps so I'll close with this: unless you're taking great pains to only use the open and cross-platform standards for them, in the not too distant future you're likely to find yourself in the exact same spot with your web apps that people were with desktop apps not too long ago. Different platform, same game...

I wrote a bit about this in another place, but the key to understanding the situation is to understand that the file format does not actually tell you how to format the text. Especially in the early days when rendering a font would work completely differently in each application because you never knew what font metrics they were using.

It's easier to be second to market because you can copy the leader. In this case, if you write your formatter with the intention of importing documents from the leader, then you always have a target. If you then add features which the leader ignores, suddenly they have a much more difficult time to catch up -- their formatter is not purposely built to render your file format.

As I said in my other message, I've completely forgotten the Excel file format, but Word's format was originally just a binary dump of all their data structures. That's why it is insane. It's hard to read unless your code happens to be structured that way.

Anyway, I'm +1000 with you on the importance of open and cross-platform standards. Just think about all those stupid rails apps -- your data model looks exactly like your sql tables. Then you dump those same stupid tables into json and splat them over the wire for someone to try to make sense of. Deja vu all over again ;-) I suppose I should be happy it's not XML...

Seems to pretty much sum up the MS MO ever since they got the DOS contract with IBM.

You could observe something similar with how they dislodged Novell from the office network, and how they responded to Netscape trying the same.

I worry that we are seeing similar tactics being deployed by other entities in an effort to grab control of the Linux ecosystem.

> I worry that we are seeing similar tactics being deployed by other entities in an effort to grab control of the Linux ecosystem.

Can you explain what you're talking about here?

Typing everything into a single core middleware that changes too often to have its interfaces reliably cloned.

Thus whoever controls the agenda for the middleware controls the ecosystem (or have effectively forked it).

Lotus also got bought out by IBM, and quickly moved their product lines to life-support mode.

So I guess we both agree that the resulting complexity of the file format itself had a lot to do with an intention of lock-in. Which directly contradicts Joel's first statement: "are deliberately obfuscated". :-)

Also, regarding web apps, I like to think of it as the building, not of "perfect systems", but "self-correcting systems". The web, with all its chaos, it still more self-correcting than MS Office has been - in fact, even from your comment it does seem that the intention was only lock-in and not improvement of file formats (other than perhaps an improvement in MS bottom line).

Oddly, even the benefits to the MS bottom line now seems more short sighted. E.g. the ongoing fiasco that is the Windows 10 auto-upgrade. [1] Someday I expect to read post-mortems on the failure of Windows 10's so called perpetual upgrade policy (i.e the notion that there will be no more version numbers after 10) which link the failure to their refusal to build self-correcting systems.

[1] Personally, I don't think we have seen the worst of it just yet.

I'm posting too much on this topic :-) Just a quick word from someone whose job it was to write filters for these file formats. They are not deliberately obfuscated. They are just the product of long lived legacy projects with short deadlines and poor process. It is absolutely obvious when you work with the formats that this is the case. Word in particular was clearly just a dump of the internal data structures in the app. It's not even that complicated -- the complicated bits come later and they are all for either backwards compatibility or for the fast save feature (or whatever it was called). The fast save, in particular, is complicated and is that way because it is append only. It is essentially a series of patches on the original data structures. Yes, it will make you want to rip your hair out, but it is very, very obvious why it's that way.

In the majority of cases, LibreOffice is a viable free replacement. There are onky a few missing features that you probably won't need. The onlu time that happened to me was trying to do a certain kind of regression curve that was only in Mictosoft Excel, for a physics class.

I would really love to agree because I hate ms so much, but sadly it's false: every time I use libreoffice, it's because I need to open a doc or ppt sent to me, and the layout is messed it's barely readable

Key features including a decent UI? I hate using LibreOffice for this reason alone.

If libre office/google docs fits your needs, you are artificially limiting your needs*. Probably not even realizing it.

Unless what you do can be done with a calculator (barebones basic stuff).

This argument extends easily - you're artificially limiting your needs by using Office, when you could be using ScipPy/Jupyter/IPython/Pandas.

> when you could be using ScipPy/Jupyter/IPython/Pandas

Who says that the HN audience doesn't?

I'm sure there are plenty of people here who use them. I'm responding to the person who claims libreoffice et al are artificially limited. My implicit argument is that the vast majority of people who use spreadsheet programs use them for list-keeping and simple analytics/reporting. For more complex uses, Excel is beaten by dedicated analytics tools.

Spreadsheets serve different needs than the tools you mentioned.

What needs are those? Which ones can't be served by the tools I mentioned?

Planning. Oftentimes all you need is to calculate some columns and play with numbers and see how the calculations change. The same things that make Excel bad for robust statistical modeling, make it good for all sorts of planning activities.

What more features am I supposed to be using when I write essays and plot small line charts?

The discussions about the suitability of Google Docs and LibreOffice makes me wonder - is there a benchmark website for file format compatibility verification? If not it may be a very useful to create one and just host it on, say, GitHub pages.

We could upload complex, but reasonably common, documents in various file formats for corresponding softwares, and create easy to read checklists of the different failure modes when trying to open the file in a different software from one in which it was authored.

The post you're thinking of:

Strategy Letter III: Let Me Go Back! http://www.joelonsoftware.com/articles/fog0000000052.html

Since that time, Microsoft had opened the documentation on all office file formats, and it's really good - very detailed. The documentation support team is very responsive when something isn't clear. Plus they provide quite many additional tools for work with binary files, and to validate them if you're implementing generation of these files.

But the formats themselves sometimes quite unlogical, especially when you embed one into another. And binary formats are very different for every office components. For example, PowerPoint - contains most of information in one big blob inside OLE, while Excel and Word tend to store smaller objects separately.

P.S. I've implemented data extraction for all MS office files for commercial products, and also participated in development of catdoc program about 10 years ago

Wow. That's a positive and well informed opinion. Like a white-rhino-unicorn on a HN Microsoft thread...

That is why Joel wrote the blog post in the first place AFAIK.

Mentioned only briefly, but the document format is potentially infinite because anyone can write an OLE/COM object that can be embedded in word (frequently done without even realizing it via a copy/paste job). The resulting object then gets serialized into the save file, which means that unless you happen to have an environment that can restart the COM object you cannot actually create something that can guarantee 100% compatibility unless you also implement most of the windows API.

And you say, so what, edge case, but i've yet to find anything that can import office documents with embedded visio, which seems to be all over creation in IT/etc documents, even though visio itself now saves in an OPC type format. This is one of the huge mac<->PC office issues and spawned a bunch of 3rd party visio clones. (which is its own set of problems).

> Let Office do the heavy work for you. Word and Excel have extremely complete object models, available via COM Automation ... You have a web-based application that’s needs to output existing Word files in PDF format. Here’s how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It’ll work.

It'll work until it doesn't [1]. Like if you want to do two things at the same time.

> Considerations for server-side Automation of Office

> ...

> Reentrancy and scalability: Server-side components need to be highly reentrant, multi-threaded COM components that have minimum overhead and high throughput for multiple clients. Office applications are in almost all respects the exact opposite. Office applications are non-reentrant, STA-based Automation servers that are designed to provide diverse but resource-intensive functionality for a single client. The applications offer little scalability as a server-side solution ... Developers who plan to run more than one instance of any Office application at the same time need to consider "pooling" or serializing access to the Office application to avoid potential deadlocks or data corruption.

And if you want to use this functionality to service the requests of anonymous users, make sure to read up until "[u]sing server-side Automation to provide Office functionality to unlicensed workstations is not covered by the End User License Agreement (EULA)."

[1] https://support.microsoft.com/en-au/kb/257757

I have seen a production system (which is still live) using much like Joel's suggestion 2 - a Windows 2003 server that uses Excel to produce reports (inside an otherwise Linux shop). It's fragile, no one wants to touch it, it's costly to migrate, but too used to deprecate. It dies about once a month and has to be restarted.

I guess it works (although the output isn't particularly good), but it's drained quite a bit of maintenance time and even more morale.

I designed a system like that for the place I currently work. I'm positive the person who follows me in my position will curse me as the Devil. However, before I set up that system, documents were created with weekly batch merges that took one person two days to complete. Now a batch can be run in about 2 hours. I'm a huge Linux nerd, but it was easier to get something working with the existing infrastructure using COM than design something entirely new with PDFs and typesetting software, or whatever. Working in Windows drains my morale, sure, but making everyone else at my company that much more effective is a huge morale boost for myself, and it boosts the morale of everyone else here too. It also allows us to serve our constituents better, which is rewarding as well.

That being said, I wish I never had to program a mail merge in the COM ever again.

I think this is where the fact that Joel has not worked for MS for a long time shows.

Then it has shown that for the last 8 years, considering it was written in 2008

For over 10 years I've written and maintained Excel automation that integrates with a web-based research database. Beyond not being supported by the EULA, Microsoft flat out says in its documentation do not do this COM is not designed for this purpose.

It's not COM that's the problem. When you instantiate Excel you're firing up an out of process instance of Excel which you communicate with over DCOM.

The problem for web apps (for example written in Classic ASP or ASP.NET) is that each request can create its own instance of Excel, and if the request that did this didn't complete and cleanly close down Excel then you end up with loads of orphaned Excel instances lurking in the background. When this happens eventually the web server runs out of memory.

Excel is definitely not intended for this type of use. Some people get lucky because they know what they're doing, 99% of the rest should just stay away.

Can't you just write it as a Windows service then and use some sort of an asynchronous queue (like a file folder)?

Worked with this kind of tech for a few years... And trash it to go with Aspose components.

It's slow, and really hard to debug when it's not working (not only pdf output but merge fields).

Aspose is a complete no brainer, yet so many clients complain about the extra cost. Somehow they don't see the hours of their own time debugging this stuff as invariably costing far more.

Close. You don't even need a Windows service. Just us a serialized queue using MSMQ. Write the queue listener in Javascript and register with COM+. Excel automation on the server is easy if you do it right, and impossible if you do it wrong.

Most of the devs I've encountered over the years who want to use Excel server side have never heard of queuing.

> Just us a serialized queue using MSMQ. Write the queue listener in Javascript and register with COM+.

Sure, but now you have MSMQ and COM+ to deal with...and debugging/fixing when it breaks (under load), or isn't secured properly.

The smart folks shell out for a copy of Aspose Cells and get the job finished on time.

Subsequent to my MSMQ approach, I did a project with Aspose. It's a pretty good toolkit, and I would recommend it over the MSMQ/COM+ complexity - Unless you absolutely have to automate Office, which is sometimes true.

On more recent projects in last ~12 months, I have used OpenXML.

I use Java RMI to connect AIX to W2K3 running a service that uses Jacob COM to call Excel. It reads a CSV file off a Samba share on the AIX box and saves back an XLS.

The lock file is on the AIX side because Win files aren't atomic.

It's a bit Rube Goldberg-ish and has low traffic, but it's rock solid.

I have done similar crazy things with Lync, back before Microsoft opened up their APIs on the Office 365 side so that they were marginally usable.

I wouldn't recommend it...

Just out of interest, have you explored using the Apps for Office functionality? I think it could provide a more robust solution (robust in the sense that it's supported by Microsoft) for integrating web interfaces into Excel. Basically it allows you to build Excel extensions using HTML/CSS/JS. I could understand using COM before Apps for Office was available.


I nearly had to support one of these - we were using embedded OpenOffice 3.2 on Solaris then Ubuntu to translate DOC into PDF. Someone suggested we use embedded Office; we said "certainly! you will need to pay for a Windows admin." We had two customers for this failed project, so they didn't go for it ...

the tale: https://reddragdiva.dreamwidth.org/599841.html

As someone with lots of experience reading Microsoft documents I disagree, in my opinion Microsoft was the worst company designing formats because:

1-Their programmers were terrible designers. Companies like Apple design first, program later. With Microsoft it was the opposite. I don't care how good a mechanic(programmer) you are if you are bad engineer(designer) and can't see the forest out of the tree.

2-They were experts breaking formats ON PURPOSE, it is proven that Microsoft actually introduced bugs to break compatibility on things like DOS(to combat competitors like DR DOS)or AVI format so people were forced to use their products.

3-There were too much programmers(most of them not so good). While Netscape put to work 20-40 people, Microsoft employed 1.000 to destroy competition with Explorer. When mission was accomplished and competition was destroyed(and nobody dare to enter the given marketplace anymore) all this people moved to other projects like Office.

4-Perversive incentives. The social promotion under Ballmer incentived people to create lots of bad code fast instead of little good code.

> 1-Their programmers were terrible designers...Apple.

The whole "Apple Good. Microsoft Bad." is really tedious and it couldn't be further from the truth. Apple is terrible at designing software.

Apple couldn't even attempt to handle the size of software projects that Microsoft deals with. They couldn't even build their own OS, they had to buy it from NeXT who stole BSD and Tivo-ized it. Furthermore, if Apple built Office or designed the Office document formats, nobody would get any of the features that they wanted.

That's because Apple designs the simplest thing that works for the majority of people and anybody who doesn't fit that mold is screwed. It takes a company with real grit, like Microsoft, to get this size job done.

> 2-They were experts breaking formats ON PURPOSE...

Oh, give me a fucking break. How about a citation?

> 3-There were too much programmers...

How much is "too much"? Is 20-40 the proper size then? Please tell us, in your infinite expertise...what is the exact number of programmers that is a good size for building Internet Explorer 1.0 through 6.0?

> 4-Perversive incentives. ...Ballmer incentived people to create lots of bad code...

Could you be a little less vague? How about an example?

For someone who doesn't want to get into "Apple Good. Microsoft Bad." level of rethoric you seem to have a lot of grudge against Apple. To answer your points, they built their own OS and they had their own Office suite.

They threw lots of stuff away afterwards, but from my point of view that's the very reason why I use an Apple laptop right now. Had they stick with the OS9 base, I'd never buy their products, the same way I don't consider buying a windows machine.

For your point 2) do you really need a citation about Office versions backward non compatibility?

I wouldn't care to qualify if those were done on purpose or not, fact is there's still product managers that green lit the release of products that would break silently compatibility with older versions in small but critically annoying ways (screwing the layout in a word doc is the last step before plain data loss)

Point 4) about incentives, I always wonder why we ended up with Office version installing libraries that would affect the behavior of internet explorer. For instance as an intern I built a dynamic form for an intranet dedicated to IE (6 I think?), but it wouldn't properly work if a recent version of Office was not installed on the system. Of course I was writing shitty code, but yet it boggles my mind that Office would install system wide libraries affecting the behavior of so many applications. There was no incentive to prevent this kind of problems, or they were just really bad designers, pick your option.

Probably the same reason you need to do a full iOS upgrade just to update Apple Music app.

re > 2-They were experts breaking formats ON PURPOSE...

there's quite a good discussion in Wikipedia

>"Embrace, extend, and extinguish", also known as "Embrace, extend, and exterminate", is a phrase that the U.S. Department of Justice found that was used internally by Microsoft to describe its strategy for entering product categories involving widely used standards, extending those standards with proprietary capabilities, and then using those differences to disadvantage its competitors. https://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish

Microsoft created the Office document standard. So "EEE" doesn't really apply here.

The person I responded to didn't list one example that was relevant to Office.

Adding proprietary modules to existing standards is VERY different from willingly breaking a closed format they created in the first place.

> Apple couldn't even attempt to handle the size of software projects that Microsoft deals with. They couldn't even build their own OS, they had to buy it from NeXT who stole BSD and Tivo-ized it. Furthermore, if Apple built Office or designed the Office document formats, nobody would get any of the features that they wanted.

Er, you mean like how Microsoft based DOS off QDOS and Windows NT off OS/2?

>> 2-They were experts breaking formats ON PURPOSE...

> Oh, give me a fucking break. How about a citation?

This was their standard practice at the time. Comes v Microsoft was all about this. http://groklaw.net/staticpages/index.php?page=20070217201900...

"stole BSD and Tivo-ized it"

Please leave your GPL zealotry at home. This is not what NeXT was.

Well it kind of is. A modified BSD kernel (open source, called darwin) that can't run without heavy modifications on non-apple hardware, with lots of proprietary libraries bolted in.

So they effectively took an open source software and made it hard to use outside their own hardware. So unless I'm missing something, macOS is basically a "tivoized" BSD.

I agree though that NeXT didn't "steal BSD". BSD is liberally licensed, so NeXT was perfectly within their rights to use it. Legally speaking, I don't think they even needed to open source their fork.

The XNU kernel is not a BSD kernel, it's a Mach kernel. There may be some BSD code in it (the unfinished net80211 wifi stack for iOS, ipfw/pf firewall, MAC framework, some things bolted on to make it traditional unix-y), but it's not a BSD kernel.

> 1-Their programmers were terrible designers.

Maybe this is true when it comes to the Office formats in particular, I'm not familiar with them at all. But in general this is absolutely not true, or at least there are a lot of exceptions. The work of Dave Cutler's team, in particular, is an example of exceptional design, and do bear in mind that this is praise from someone who's been using and writing code for *nix systems for 15 years.

And don't get me started on Apple's "design first, program later". Most of the places where this still shows are inherited from NeXT. Their only serious attempt at an office format (that I'm aware of), in Pages, was pretty much a disaster.

Why should Apple users have to consider saving documents as a package vs. a single file? Why is this even an option?

Well, the package format proved to be a compatibility nightmare for Apple.

"Design first" didn't really work out. The Apple Pages document format is a compatibility nightmare and Apple fixed it by putting the onus on the user to fix it with abstruse save options.



Could you at least point out the logical fallacy in his argument? Simply saying "you're wrong, but I'm not going to tell you why" is far from productive (and ironically borders on the fallacy fallacy).

I was running a team of developers who reverse-engineered Office and other popular file formats to recover data from damaged files (that was in the 3.5'' diskette era and hard drives were prone to bad sector failures back then too, so there was a sizeable market for that). And yes, file formats were really complicated, which helped our business: there was a really high entry barrier. I remember MS Exchange Server took two man-years to crack.

I once worked with this insane genius who reverse-engineered DOC and PST by bit-banging the files in Visual Basic. Once I mentioned the compound document format and he was puzzled for a second and then shrugged "I guess that's what that was". He clearly hadn't read any of what little documentation existed. He was the complete self-taught lone hacker type, his code had zero organization and was only understandable to him, but when it came down to it, it mostly worked and was a huge competitive advantage, for a while. I was hired to ship the web frontend, the backend document pipeline was solely his baby. Eventually MS shipped a patch to Office 2003 so that save as HTML mostly worked as expected and there was no longer an advantage in parsing the binary files.

Probably not with me. I never really worked with Visual Basic, that stuff was done in C++. And for Word, we used the available docs and yes we knew that it was a compound file as an outer shell (while we wrote our own fault-tolerant parser for it, which worked around damaged pieces whenever possible, while initial development was done by using intact files and using the Docfile API). The devil was deeper, the format of document stream and formatting stream was quite complicated, and documentation was very outdated and lacking detail. Extracting just the text was easy - simply go through plcfpcd and extract positions of pieces and their encoding (ASCII or Unicode), convert all to unicode and print to standard output. Extracting formatting (we used RTF for output) was crazy.

PST was indeed much, much harder to do, it was a crazy effort which took about a man-year and it was all done by hand.

And we never had web frontend for it, at least not on my watch (i left in 2007)

Well, you might be very qualified on this thread to let us know then - was the complexity of the office file formats proportional to the feature set, especially in comparison with the other file formats you reverse engineered?

I am not the original poster, but I also worked on office file formats -- specifically I was one of the poor saps who worked on file import and export for Word Perfect after it was acquired by Corel. Before you send me hate mail, in my defence the code was mostly written before I got to it, and I was merely fixing the innumerable bugs in it.

I'm mostly familiar with the Word file format, so I will restrict my comments to that. It's been more than 15 years since I did this stuff, so my memory is hazy -- specifically I can't remember how the Excel file formats work at all.

Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn't know any better.

Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy work-arounds for things that would be simple if you allowed yourself to redesign the file format. It's pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.

Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That's because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word -- only the file format doesn't tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like "Format like Word 95". The format doesn't say what that is -- because it's pretty obvious that the authors of the file format don't know. It's lost in a hopeless mess of legacy code and nobody can figure out what it does now.

For programmers who have worked on long lived legacy systems before, none of this should be a surprise. People think Microsoft purposely obfuscated their stuff, but when I worked at Corel, Microsoft used to call us up to tell us when we had broken our Word export filter. At least by that point, having Word as a standard file format was a plus for them. However, whenever we asked them what we should do to fix the filter, they invariably didn't know -- we knew more than they did.

> Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

That's actually really interesting. Got any more details on that?

It's been ages and someone told me that they are allowing Word's pagination now as long as you explicitly say that's what you are doing (or something like that). But basically, pagination is incredibly important in legal documents. References are to specific pages and because footnotes can often span multiple pages, it is important that you render it correctly (or else the reference will point to the wrong page).

There is a specification for how to paginate legal documents in the US, but I don't remember where to get it. IIRC it is based on the Chicago Handbook of Style. The other place you can find the correct explanation of pagination is in the TeX source code, because it does it correctly.

My memory of what Word does wrong is really fuzzy, but I think it has to do with footnotes that are longer than one page.

Every version, the DOJ would order thousands of upgrades of Word Perfect at full price. In exchange for that, we would pretty much fix any bug they wanted. We even wrote a printer driver for them once. If you look at the year end reports for Corel around the 2000 year mark, you will see the office suite numbers broken out. These are mostly to legal document users.

My understanding as to why Word Perfect continues to have a major portion of the legal market is the huge installed base of template documents.

Want your lawyer to prepare a trademark infringement letter? He's going to charge you several hundred dollars for basically filling in the fields on a template he created a dozen years ago. And they aren't going to do anything that threatens that goldmine, like switch word processors.

WordPerfect is no longer used by any major law firms. In 10 years of practice at one of the world's largest law firms, I've never seen a WordPerfect document.

Thanks, that was excellent reading.

Looking at your comment on how the file format is only declarative and sort of allows the implementer to decide what to do - this is similar to what HTML does. But HTML being an open spec, actually allowed implementers to deviate in some ways but there was still some minimum expected compatibilities, so its progress didn't suffer as badly.

But the Word file format was not even an open spec, so really, when people sometimes jump to Microsoft's defense, they are not even understanding how technology innovation actually suffers from closed standards which become dominant.

On the other hand, Microsoft itself could have benefited from having an open standard of the file format around when it wanted to write similar exporters. Why didn't it push for it, given the enormous power it had? Surely they put their business interests over what was the best thing to do in that situation from the engineering perspective?

I say all this because the more you hear from the insiders of that era, the more Joel's article is starting to look quite lopsided and biased, specifically this statement - "At no point in history did a programmer ever not do the right thing, but there you have it." It is nice to be able to fill out some of the missing details.


Edit: OK, are you also saying that once the initial decision was made to have a binary dump of all the data structures, there was no possibility of improving the format?

If so, does it mean that every application which had its own file format and took a similar approach as Microsoft's during that time (given that you say the original developers didn't know any better), was similarly stuck? How did Word Perfect design its file format?

Without throwing away backwards compatibility, it's pretty hard to improve the format. They could have easily done that, of course, but it was clearly not on the cards, politically. As a programmer, I think that's a bad decision, but it might have been the right decision for the user. It's hard to say.

Word Perfect actually had a vaguely non-sucky file format. When they "handed over"[1] development from the original WP team I had the opportunity to chat with them about it. They were very, very proud of having designed an actual file format rather than just dumping crap onto disk.

Essentially, it was a stream of tokens, each token representing a command. Again, it's been ages, so forgive me if I get this wrong, but you would have a "bold on" token, followed by some text, followed by a "bold off" token. For more complicated tokens I seem to remember they would embed the options in the token.

If you have ever used WP, you may have seen the "reveal codes" feature (which was WP's "killer feature"). This was pretty much the actual representation of the stream. It allowed the user to see exactly how the formatting codes were layed out so that you could fix problems (instead of trying to guess why you have some bizarre vertical space in your document as you often have to do with Word). The main problem with WP's file format was that it was very, very easy to generate illegal streams or to mismatch tokens (especially to get the options wrong between the start and close tokens). I always tell people never cut and paste in WP because it will corrupt the document eventually (a little more technical background: cut and paste was implemented by going through the RTF filter and back, which could easily corrupt the tokens on the way through).

[1] Cautionary tale: Word Perfect Corporation was a generous company. When Word Perfect became very, very popular, they rewarded the original programmers handsomely. However, instead of giving the programmers equity, they gave them gigantic raises. By the time Corel acquired WP, the original programmers were on salaries that would make even high flying valley programmers blush. After sitting out the time period that they were contracturally obliged to, Corel replaced those people with cheaper talent (roughly 1/10th the cost!). That included me. I always felt horrible about the situation, but fervently hoped that the programmers managed to save some of their ridiculous previous salaries.

Thanks for all your insights :-) Even though you feel a little conflicted about it, I would like to think all of us here at HN are very glad to know about your story.

This was a better read than the article. Touching on the same points, but briefer and from the perspective of someone who actually worked on it. Thanks!

miklos Vajna struck a problem when he implemented complex drawing shapes in LO 4.3 - in Office 2010 he found a document that displayed a green triangle, but in Office 2007 it displayed a red triangle. I believe this was under OOXML Strict...

Well, if it is a dump of memory, than what is a 'fast save' feature?

Microsoft largely documented the file format here:


I really enjoy this article. It's nice to see one thats not just bashing the format as bad or insulting the developers.

Did Microsoft solve any of these problems in their "newer" file formats (IIRC its something like .docx instead of .doc)? And are those as crazy after a few years or have things gotten better since then?

Back in 2007 I was a proponent of the 'No-OOXML' movement [1], which outlined several objections (linked in the left margin on their page), some technical, some political, to the ISO standardization process of the new XML formats.

In my opinion, the new formats are little more than XMLified representations of implementation details of how MS Office works; there is a distinct proliferation of elements that aren't purely semantic, and an awkward-but-inconsistent lack of separation of structure from presentation. But I'm not an expert on the binary formats nor the inner workings of their applications. Though my opinions remain unchanged after 9 years, in retrospect I do like that the new formats are documented by a true standards body, and are interoperable not just by reverse-engineering a closed format.

[1] http://noooxml.wikidot.com/start

The best example on how this "just serialize our program's operations" style of file formats turned out crazy was the WMF exploit. Merely hovering over a malicious wmf file in explorer would trigger the payload. And the bug was literally a GDI method being played back which was "call my callback function".


> Documented by true standard body

Is ECMA that "true", I don't know. At least they document a JavaScript without "JavaScript" in its name. MS Office (2010 at least) isn't compatible with the standard text. Is there even a MS Office nowadays that's better? Also Office 2010, dispite support, is hostile to OpenDocument format - opening a document gives you warnings about legacy format and what not.

The OOXML standatisation process was very controversial and left a bad taste around MS and ECMA Switzerland: https://en.wikipedia.org/wiki/Standardization_of_Office_Open...

ECMA 376 is explicitly intended as documentation of what Office 2007 does. Michael Meeks of LO (then at Novell) worked on it and says it did this job: https://people.gnome.org/~michael/blog/2014-02-26-cabinet-of...

But in practice there's no such thing as "OOXML" - there's only what Office 2007, 2008, 2010, 2011, 2013 and 2016 happen to do. So if you want to work with the stuff, you verify everything. The standard is, like so much Microsoft documentation, best treated as based-on-a-true-story fiction with an unreliable narrator.

But the old formats are also (mostly) documented by Microsoft. Only once did I have to reverse-engineer a magic number, and only to get something working with Excel 97:


For the most part, with Word I get better compatibility in LO with the old binary formats than with OOXML. If someone sends me .docx and it looks garbled, it's easier to ask them to resave as .doc than to try to get them to produce .odt

If you have the time, you should consider filing a bug report.

I once suggested getting rid of the name and the ISO "standard" but keeping the file format itself in a poorly written blog article.

.docx is a much nicer format than doc.

It is literally a zip file containing a some xml files [0]. Having said that, the spec is still 106 pages, but the entire xml schema is only 8 pages; the rest is exposistion.

[0] Some features (such as embedding images) probably involve including non-xml files.

[1] https://msdn.microsoft.com/en-us/library/dd773189(v=office.1...

You only linked to the "extensions" MS added on top of OOXML, whose specs are in total an order of magnitude bigger. ISO/IEC-29500 or ECMA-376 is the actual spec, and its first part (of 5) is 5000+ pages. In comparison, 349 pages for the main spec + a 9 page auxillary spec for the document "filesystem" seems almost trivial.


Much nicer? Not really. It literally a transpilation of the binary format. The XML files a very abnormal for everyone but who are familar how the original older doc/xls/ppt fileformat works. E.g. Look at how the Bold-tag is implemented, no sane developer would implement it that way. It's clearly done to mske it as straight forward port of the old format and make it as difficult as possible for every one else. The XML even contains binary blobs and all. A low-light. Every other modern Office format is nicer.



Worse, it's designed in a way that works completely differently from every other XML office format to allow Microsoft to patent it. Doing document markup the normal way in XML has too much prior art to be patentable (though that didn't stop Microsoft from trying), but they managed to get a patent on their oddball way of doing things.

The PDF spec is 1000 pages. Nobody seems to complain. Or at least they complain a lot less.

Fortunately the only people who can read 1000-page documents are those who already have working PDF renderers, so the problem is self-limiting.

You can ignore about 800 pages of that for most applications though.

Not really, when you start to implement it, the more and more functionality required as you get more samples.

I wrote an open-source reader for docx, including for math equations, in a week or two of part-time hobby work. The spec isn't always sensible (though I'm sure there are historical reasons for most of the decisions) but it's always easy enough to look at the xml and figure it out. Most importantly, your language only requires a zip lib and an xml parsing lib. So, yeah, I'd say it's much nicer to deal with than what this article describes.

My understanding is that the .docx format is basically just the .doc format, transcribed to XML, and with a .zip filesystem instead of an OLE Compound Document filesystem. In fact, I recall some Office-competitor developers saying they preferred .doc, because (like Joel says) they could just read it into RAM and use it, instead of having to deserialize XML.

In practice there are assorted binary or encoded lumps.

More fun: you know what happens if you save a .DOC in Office 2007/2010/2013? Any features past 2003 are saved as a lump of OOXML! http://vmiklos.hu/blog/doctok.html#comment-header

There are some parts of it, like VB macros, that are still stored in the old format before being put in the zip file.

It's nice to see one thats not just bashing the format as bad or insulting the developers.

You know the author was part of Excel team, don't you?

The article seems like an explanation, not a justification. Moreover, I totally agree with the conclusion: use OLE to extract information programatically.

But the format is still terrible :-) The fact that there were reasons to do it that way at the time is not the same as saying the decisions were the right ones. Probably there were reasons to do <insert here any despicable historical or technical catastrophe>.

I didn't know that the author was on the Excel team, but it doesn't really change my feelings.

I'd say that they were the right thing to do at the time, but in the long-term they turned out to be less than ideal.

From Microsoft's perspective, the formats allowed them to be faster than the alternatives, seemed easy to maintain internally from the sound of it, and they were able to cut and run from the format when needed.

If they decided to go with a format that would be "good" for the next 30 years, would they have been as competitive back then?

As with many things, time has a way of changing our perspectives. No, Office formats would not have been competitive on earlier machines if they used a lexer/parser. Office documents weren't really documents in as much that they were memory snapshots. This is why early Office products had such high productivity over other competitive formats. If they were being written by scratch today, the formats would be very different, but then they wouldn't be as feature rich as their predecessors.

One thing overlooked is that the transition to the new DOCX formats (and PPTX, XSLX, etc.) was that they were being written about the same time Windows was working on Longhorn. Longhorn featured WinFS. WinFS would open the XML document formats and save assets like embedded images as their own components in the filesystem and then would reassemble the documents as you moved them from a WinFS store back to a traditional filesystem. The XML formats were in a similar way designed to make that process fast, just as the binary images did for earlier versions of Windows. The XML document formats were designed to integrate with a filesystem which never shipped. Without WinFS, the formats weren't built for third party interoperability, but were designed to make it easy to decompose a document into block components which the Office apps otherwise handled like verbose but compressed binary formats. If WinFS had shipped, the OOXML formats might not seem as convoluted.

"the right ones"...what would the "rights ones" have been besides shipping functioning software for their users and making money for their shareholders?

Functioning and right are not the same thing. And right for the shareholders is a strange way of defining what's right. Strange for me anyway.

That's kind of my point. On this forum we tend to define "right" by a certain standard of engineering aesthetics. I say a certain standard of aesthetics because it definitely isn't the only one.

Other definitions of "right" could reasonably include "our software performed the intended functions expected by the people that paid for it."

One common definition of "right" in capitalist countries is the fiduciary responsibility to maximize profit. It's not one that I personally think works in every scenario. That version of "right" can often find it's self in direct contradiction of definitions from moral, social, and legal areas.

My only point was that "right" means a very particular thing on HN. It means Unix-y...but not actually Unix...its this Unix that inhabits someone's mind. Where X and sockets are also "files" and text is the only RPC system you would ever need. It's also functional programming fetishistic. LISP because of the great PG. Erlang because I heard of this one really awesome app with great up time...and have you heard about Plan 9, it's perfect if it weren't for {insert dark force of mediocracy} that kept it from being a winner.

Thanks for asking. I've been needing to say that to someone for a long time.

I think it is better than the older binary formats, being based on XML.

And are those as crazy after a few years or have things gotten better since then?

If anything, I'd say they're worse.


XML'ized everything (yuck!), and instead of "just" Microsoft having control over the format, it's now a whole design-by-committee bureaucratically bloated mess with other parties trying to stuff things in too.

XML is a perfectly reasonable format for documents.

Perhaps for documents which are intended to be manually edited in a text editor, which is hardly the use-case here. Otherwise it's just extremely inefficient because of all the parsing involved. You can't know when one element ends or another begins, or skip through the parts you're not interested in, without reading every single byte and parsing it. Your XML parser has to handle extra whitespace (which will almost never be present), various quoting options, and arbitrary attributes/tags; all "decision points" within the code that add complexity. With a TLV format like the Excel binary one, you can seek through the records without knowing what they contain just by reading the header and adding its length field to the current offset. Then when you get to the record you want (identified by a simple integer comparison), just do a length check and read into the appropriate structure. There's far less to do, far less decision points, and far less to go wrong.

Except, of course, if malicious files are supplied or they are corrupted somehow.

Regardless of XML-ness, you still have to check that the values are valid. With XML, you also have to check whether they're of the right type, because the format is so much more flexible.

You have to check if they are the right type if they are binary files as well.

It is far easier to repair and recover an xml file than a binary format.

except you have to move the rest of the data when you edit the document or you can fragment the data but the old data will still be on memory / memory mapped file / file.

Laziest way to output an excel file: Write an html table to a file and make the extension ".xls"

This has the minor niggle that it will throw a warning to the user ("The file format and extension of *.xls don't match...")

But it has the "advantage" over the csv approach that you can include formatting via hmtl/css styles.

That's broken now in Office 2013 and Office 2016. Several of my users -at the same time- found that they could not open "excel" documents provided by several different vendors, all of which were actually HTML tables in files named ".xls". My guess is that Microsoft broke this with a software update, but I've yet to research the problem.

This is caused by a security update, details and workarounds on the link below


One of our history systems did that, but it doesn't work with the Office 2012.

There is an XML format that worked with the pre-OOXML versions of Office and should continue to work with moderns versions as well. Plus it's a drastically simpler file format than the complex zipped XML-cluster of files that describe XLSX documents.

As far as I know there is no Office 2012 (either 2010 or 2013) and the html table trick works on both.

The version numbers for Office on Mac are different to those found on Windows.

Ah ok, didn't think of Mac. thanks

> These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.

Yeah, well that was eight years ago. There's no way that, immediately below this item on Hacker News would be another item boasting that

> Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!

I'm sure that's why this article is listed here today.

This comment on Cap'n Proto came one hour before this article was re-submitted: https://news.ycombinator.com/item?id=12471541

The problem is compounded by users not knowing exactly what it is that they’re storing/transmitting/archiving, and not caring enough to push for something leaner and more accessible.

It is also sad to see people deleting things to “make space” within some quota, when a massive amount of storage is clearly caused by unnecessarily-bloated files. How do you convince them to abandon the only editors that are familiar to them? How do you make plain text the new default?

People seem to like cracking open Word, typing a couple paragraphs and “sending that” to everyone. They don’t realize the kitchen sink comes with it. In the old days, sending large file attachments to an entire organization could be a disaster. If the server wasn’t very smart then it encoded and COPIED some monstrous Word file to everybody’s Inbox. And copied it again for reply-all. Inboxes would run out of disk space because E-mail sizes were insane! It was really frustrating to see that the amount of useful content was so small compared to the footprint.

Are they really "so complicated", or is it just a large amount of options, many of which might actually be ignorable for the task you're doing, that contribute to such an impression of complexity?

They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably.

To me, this suggests the opposite --- a complicated format would be difficult to parse efficiently. I found this document:


...which shows that it's basically TLV, so for simple data extraction it doesn't seem so difficult after all. For example, if you're just after cell values, you don't have to care about fonts, printing, and other formatting info.

The XML for Word files is a pretty big mess and it doesn't follow the spec in places. Headers for example, there are three section headers that can be specified, default, odd, and first. If odd and first are not included, it should go to default (according to the spec [1]), but it doesn't.

The old format had some shortened names which made stuff really hard to follow to the uninitiated. [2]

1. http://www.ecma-international.org/publications/standards/Ecm... (1st edition, part 4 for section headers.)

2. https://msdn.microsoft.com/en-us/library/office/dd921346(v=o...

> Are they really "so complicated", or is it just a large amount of options, many of which might actually be ignorable for the task you're doing, that contribute to such an impression of complexity?

Apache's library called it HSSF, Horrible Spreadsheet Format. You make the call.


I have literally used it for YEARS and never knew what HSSF meant.

https://en.wikipedia.org/wiki/Apache_POI was a fun read

HPSF (Horrible Property Set Format) HSMF (Horrible Stupid Mail Format) DDF (Dreadful Drawing Format)

Amusing, but I also have the feeling that it's difficult for Java developers to appreciate the benefits of a binary format because the language doesn't make it easy to work with them. XML, on the other hand...

Part of Apache POI, IIRC.

It stood for:

"Poor Obfuscation Implementation".

Yes, they are really that complex. The Office file formats are incredibly complex. Lots of moving parts, many different ways to express the same thing, lots of legacy stuff, endless obscure low-level bit shuffling. The OLE container is a small filesystem. The Word format includes its own heap.

What makes them fast is that they were designed to not require parsing at all. They are essentially a memory dump and can be loaded with a simple mmap. But if you want to "understand" what that data means, then you have a long and complex road ahead of you.

"A lot of the complexities in these file formats reflect features that are old, complicated, unloved, and rarely used. They’re still in the file format for backwards compatibility, and because it doesn’t cost anything for Microsoft to leave the code around."

That's not really true. It is not free to leave old code around.

It's not free, but taking the code out isn't free either, and keeping it around makes it easier to sell upgrades to newer versions.

I agree. A more accurate explanation is that it would cost Microsoft money and/or reputation to take a feature away from people who take it for granted. As the de facto enterprise solution, they're stuck supporting their old mistakes.

How do MS Word and Excel file formats compare to the file formats of WordPerfect, AmiPro, WordStar, Lotus 123, Borland Quattro, etc?

They themselves suggest writing Lotus 1-2-3 files at the bottom of the page. Unfortunately Excel 2007 and later no longer supports it.

What I'm edging at is that for all of Joel's apologia for Microsoft Office formats, other comparable tools of the era were vastly more sensible.

Though I didn't find a doc for WordPerfect file format spec offhand.

"Sensible" cuts across a lot of axes. Those comparable tools, for the most part, didn't have the same sorts of OLE wizardry that are really pretty useful for normal, non-technical people. That comes at a cost of technical complexity.

By "sensible": it was possible to implement independent, reliable, read-and-write capable tools for these formats.

OLE was a tremendous boon to lock-in on the part of Microsoft. It was useful, yes, but hardly flexible. There are other tools which offer comparable capabilities without the lock-in elements.

Hal Varian (currently Google's chief economist) wrote the book on vendor lock-in, and how to both secure and avoid it (as a customer) in the late 1990s.

There weren't comparable tools in 1990. There still aren't, for a lot of things--something as simple as embedding a spreadsheet table in a document is still not really feasible except through similarly closed mechanisms.

Your attempt to define flexibility solely as "the use of non-Microsoft tools" while casting aside exactly what OLE does for non-technical users is pretty transparent.

There's little notional difference between embedding one software tool within another, and calling one from another.

There's a considerably simpler architectural structure for the latter.

You still need the full multi-application support available. We're doing that today with browsers (the universal document reader) and plug-ins. Which are generally being considered a Bad Idea, and functionality (e.g., PDF readers, video) now being natively supplied.

I'm defining flexibility as a lack of arbitrarily-imposed constraints. Which is what the text on lock-in I referred to discusses at length.

I'm well aware that simple and expedient solutions often end up being long-term untenable. This doesn't mean that they're not simple and expedient in the first place. Though that simplicity often comes from the capacity to impose a single standard across an internally consistent (at least on a point-in-time basis) architecture.

Information technology vendors have long exploited the matter of standards to self-serving benefit. Microsoft were not consistent in either supporting or opposing standards. They were consistent in applying standards policies to their own benefit. Promotion of the IBM-PC compatibility standard increased the platform for Microsoft OS and applications sales. Hindering standards such as Ethernet, Internet, HTML, office application formats, Silverlight, OLE, AD, Exchange connectors (POP, IMAP), etc., was also strategically pursued.

You're focusing on the software specifics rather than the strategy. Yes, the trees are lovely, but there's a forest you might care to observe.

Because when you dump memory contents to disk, unswizzle the pointers, and call that your file format, things stop being simple.

It could have been worse, it might have been mork.

Reading this makes all of the personal struggles I have had writing software seem so petty.

It was difficult getting past the words Windows Meta File, the cause of one of my many software headaches in graduate school. I never found a way to export vector graphics from Matlab (admittedly not a Microsoft product) in a way that they could be embedded in LibreOffice (or any other open Office clone) to produce PowerPoint-compatible documents. But I certainly tried.

If the requirements were just to produce a document I could have generated PDFs, but my department head wanted PPT slides. Apparently there is no other vector graphics option that is compatible with both Matlab and Office, and there are zero useful tools for editing WMF files, save for the pathetic options available as part of the Office suite.

It might take thousands of work years to recreate Office from scratch and with an identical timeline but I categorically reject the notion that a clone from scratch without that historical constraint would take that long.

Well it would take less time to make a clone, but I'd argue it might take longer to fully support the format in a program that was not designed for it, and is trying to use the format in different ways than it was designed.

Microsoft created the OOXML formats from scratch, which must have implemented the almost all same features.

You unfortunately cannot "just use csv" if you're dealing with modern tabular data. (non-ascii)

Here is an example of basic BOM ignoring and delimiter insanity demonstrated by SAP: https://wiki.scn.sap.com/wiki/display/ABAP/CSV+tests+of+enco...

Long story short, Excel is to tabular data as IE was to HTML/CSS/Javascript. Its headed for a cliff once people start realizing how bad it is as doing basic tasks.

If you're targeting Excel and generate CSV, then use UTF-8 with signature and include the sep= line at the start. Joel mentioned it as an alternative to creating Excel files, i.e., creating a file specifically for Excel to read. If that's the goal you don't need to care for other CSV readers and simply make things nice in Excel.

SAP probably cannot do that.

SAP can do that, but the idea is to have standard support.

And that means that any program reading or writing CSV should support all possible standard-compliant CSV files.

You might get a CSV file from someone who created it with German Excel and want to read it into French Excel (it won’t work).

With all of these comments that know so much better about everything related to software than Joel or Microsoft, no wonder so many of you have written successful office suite software that has a billion customers.

Oh wait...

Most people who read mikekchar's comments would now have a second viewpoint of the story which differs from Joel's version.

Also please do tell us exactly what you have created so we can also gauge precisely what you are qualified to talk about in the future using your own metric.

And finally, since your comments on this story also obviously add a lot of substantive value to the discussion, we are now even more fascinated to have heard your very compelling argument in favor of Joel's version.

Oh wait...

Thanks Mr thr0waway1239. I would love to answer your questions!

I've never built anything even remotely approaching the scale of MS Office...that being said, I try not to take the experiences I do have and use them to belittle the work of people in vastly different scenarios. To read the comments in this thread you would think the binary format's of the Office Suite were proof positive that the people at MS were completely incompetent AND Machiavellian enough to make software that worked, but obfuscated it's internals to throw off the most wiley reverse engineers. It reminds me of the way that Republicans say that Obama is the most incompetent bumbler and the most tyrannical schemer playing 3 dimensional chess while continually tripping over his own shoelaces.

That was my point. Statistically, most people never do anything great, yet there are so many opinions about how to do great things...we know that most of them must be wrong. Just not OUR opinion, right? We KNOW the right way to do things...not like those assholes...they...blah blah blah.

I thought that Joel's article had some good points. It's good to hear some inside knowledge on things sometimes. His point about using libraries to extract data was pretty much the best idea going for a long time when dealing with those documents. I would say that if you have to deal with that stuff these days that the Microsoft Graph is the easiest way to do it. It's a REST api that you can use to create and modify different Office objects. That's just me thinking "how do I do this in the least amount of effort", though. I'm not saying that anybody else should be doing it.

The thing is that most of the anti-Microsoft sentiment here is just knee-jerk red team/blue team BS. It's more tribal than it is anything else. I just wish that some of my programming brethren were a little more willing to question some of their own assumptions.

So thanks for genuinely engaging me in an open dialog about honest conversation...

oh wait...

You can have a successful product but hat doesn't mean that the document format is necessarly good.

What is said here holds true in 1995. Post 2000, interoperability was pretty evident. Microsoft is simply scared that if they made a simple enough format they would lose the hold they have on the Office software.

Computers are faster and robust to handle I/O much better now. If they were by any means a little ethical, they would have created a simpler format and built on top of it instead of changing the entire spec every iteration of their software.

Software does not evolve by retaining irrelevant code.

Microsoft has a document (KB257757) strongly recommending you not try to use Office server-side. They have some recommendations on what other things you could do instead.

The latest versions of office support open XML formats(docx,xslx etc) which are XML files basically. It's not mentioned in the article but using open XML SDK is the supported way for working with office docs https://msdn.microsoft.com/en-us/library/office/bb448854.asp...

A little bit off the topic: But does anyone know if Google releases the file format specifications for their Google Apps Office products?

As far as I know, you can't even download a file in the original format. You have to download the Word / OpenOffice version instead.

It reminds me the format of fonts. That is, your font is actually a program, a program used to draw shapes. There are even subroutine calls in it, and it is widely used to reduce file size.


So let me get this straight...

Microsoft programmers where terrible programmers that were expert enough to break code for other people while still having it work well for their customers?

I don't understand why they have to be when w3c can get html right

Since when did the w3c get html right?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact