

Microsoft Open XML embarrassment: spaces go missing between words - bensummers
http://www.itwriting.com/blog/3912-microsoft-open-xml-embarrassment-spaces-go-missing-between-words.html

======
ZeroGravitas
_"as I understand it a large part of the point of Open XML is to preserve
fidelity in archived documents"_

No, the point was to preserve a monopoly. The stuff about preserving fidelity
was just a smokescreen since anyone paying attention knows that Word has a
long history of changing formatting depending on which printer you have
plugged into a machine. If you want that kind of archival then use the PDF/A
standard.

~~~
brudgers
The main purpose of Open XML is to facilitate processing and creating
documents programmatically using standard tools to manipulate the DOM.
Considering the vastness of resources which exist across platforms for
manipulating XML (including those for translating to and from XML schemas),
this is hardly a proprietary move on Microsoft's part. If they wanted
proprietary, they would have copyrighted the format as Autodesk did with the
Revit file format.

~~~
RodgerTheGreat
I don't know if you've actually examined the OOXML format, but I spent about 6
months dealing with it in great detail at my old job, specifically Excel. For
simple things, conventional XML manipulation tools could potentially be
useful, but there are some serious issues.

For one thing, the files office generates and consumes are significantly
different from the format as specified in the ECMA and ISO standards. The
schemas for ECMA-376 will specify elements as occuring in any order, while
Excel will _require_ those elements to appear in a fixed order. Tons of tiny
problems like this exist everywhere, and you only need one mistake to create a
malformed document. The machine-readable XML schemas provided by Microsoft
_will not work_ without massaging.

The format also makes it very difficult to make simple in-place changes. A
good example of this is what you might consider the most basic information in
a spreadsheet- cells. Cell tags are contained in row tags which are contained
in a sheetData tag. Rows contain their row number as an attribute, while cells
contain their A1 "row/column" ID. This means if you remove or reposition a row
in a spreadsheet you must update every cell in every subsequent row.

How about extracting a string value from a cell? The format makes it possible
to store inline strings directly in a given cell, but Excel itself never uses
this functionality. Instead, the cell contains an index into the
SharedStringTable, which is stored as a separate XML document. If you delete
or modify a cell, it might be referencing an SST entry that is no longer used.
The only way to know is to search the document globally, remembering that
dozens of different elements could potentially refer to a shared string. If
your goal is to avoid bloating documents with junk, you have to solve this
problem for a number of cases- style references, fonts and more.

If you want to modify OOXML documents in a robust, thorough manner, you'll
deal with tons of issues like this.

~~~
lars
I've also done some work dealing with the Excel file format for a personal
project (.xls not .xlsx). I think it needs to be clarified (not that you're
implying this), that at least a lot of this mess isn't deliberate obfuscation
on Microsoft's part.

For instance the SharedStringTable is something that made a lot of sense when
documents had to fit on floppy disks. Excel is 26 years old, a when you evolve
the file format for as long, while trying to maintain backwards compatibility,
you'll inevitably be stuck with a messy format.

~~~
VMG
I doubt that _backwards_ compatibility is a priority at MS Office development

------
rbanffy
I wouldn't know how to write this bug...

It should serve as a testimony on how hard it is to implement Office Open XML.
Not even Microsoft can get it right.

~~~
edanm
This might be a funny quip, but I disagree. Implementing _anything_ is hard,
and weird bugs creep in. I've never seen any program without bugs. So it's
pretty disingenuous to take one bug in Word and say that it shows there's
something wrong with Open XML.

~~~
rbanffy
> Implementing anything is hard

Come on. Implementing something you invented should be easy. I would
understand if the [Open|Libre]Office folks got it wrong, but Microsoft? The
same company that basically discredited ISO (and badly damaged its function
afterwards) in order to standardize this monstrosity? To not even bother to
implement its botched bogus standard correctly is beyond insulting.

And yes, of all things wrong in MS Office Open XML, the bugs are the least
important.

~~~
kenjackson
_Come on. Implementing something you invented should be easy._

That's absurd. So you're saying there's no Apple bugs in Quicktime or Cocoa?
You're saying there's no bugs in Emacs that Stallman wrote? You're saying that
there's no bugs in Mathematica written by Wolfram? You're saying that there's
no bugs in Java produced by Sun? You're saying that Ken Thompson wrote no bugs
in Unix. Stroustrop wrote no bugs in C++.

I've never seen a non-trivial program, standards-based or not, that is bug
free, period. Not one.

Heck, there's a 30 year old bug in _binary search_ that largely went unnoticed
-- even Donald Knuth missed the bug!

Bugs happen in trivial programs. Any non-trivial program will have bugs.

This is completely insincere. Unless you're willing to say the same thing
about ODF and virtually every other file format that exists, since I can find
bugs implementing just about all of them from their core proponent.

~~~
recoiledsnake
Get used to it. Regardless of the technical merits, it's cool to hate on MS
and blindly support Apple/Google on Slashdot, Reddit and even more so on HN.
I've seen people quit HN in disgust because of the arguments, comments and
moderation of Apple fans on here.

------
joelhaasnoot
Noticed this the minute I installed the Office 2010 beta, but vice versa: the
spaces from my Word 2007 document were gone. Made me go back to 2007: wasn't
going to rewrite my 20 page report.

~~~
dspillett
_> Office 2010 beta

> <sad story about a bug>

> wasn't going to rewrite my 20 page report._

And the moral of this story: don't use software that is officially of "beta"
quality for important work. This is not specific to MS Office.

~~~
joelhaasnoot
You're absolutely right, but the best test there is is the real-life one. And
I was prepared for this so could go back. No animals were hurt, no dogs were
blamed for the homework, but it have a great first impression of the
"improvement".

------
brudgers
Link to original CNET article:
<http://news.cnet.com/8301-1001_3-20034213-92.html>

Link to original thread on Microsoft forum:
[http://social.answers.microsoft.com/Forums/en-
US/wordshare/t...](http://social.answers.microsoft.com/Forums/en-
US/wordshare/thread/2764c5ac-4f7c-4a6d-9419-9e37bddf82d8)

------
smackfu
That is a pretty obscure test case. And I like how the "severe" impact this
had was that someone got a bad grade on a paper.

~~~
notahacker
Sending a file in the default format of the world's most popular word
processor to another party who opens it on a computer with different settings
and a slightly older version of world's most popular word processor doesn't
sound especially obscure to me.

I'm assuming most people that fail to get invited to job interviews because
missing spaces make their resume/cv look careless to a prospective employer
opening the file in Word 2007 probably aren't going to be aware of the reasons
behind the decision...

~~~
smackfu
Well, it's not like the test case can say "different settings." The devil is
in the details.

------
tzs
This sounds like simply a bug in Word 2007 or Word 2010, not a problem with
the document format.

------
trezor
As far as I see this, he admits the problem at the end: Microsoft Word is a
Word Processing program, not a publishing program.

You are not guaranteed to have your layout preserved when printing. This can
be for a variety of reasons, but it could be simple stuff like having your
file in A4-format and your printer only having papers of type "Letter" or
something equally silly. In cases like these, Word is _forced_ to reformat.

If you need or rely on a 100% accurate re-representation of your content, Word
should not be your tool. Never. Use PDFs. Simple as that. If you are writing a
normal texts however, it will probably never be a real problem which you will
even notice.

Now, my question to the author (should he check out HN): How on earth is this
related to OOXML? Where is the smoking gun saying this is a bug in the file-
format? I honestly don't see it, and I don't see anyone else here questioning
this unbacked claim.

I honestly expected better from HN.

~~~
zzleeper
Check the article again. It's not about layout preserved when printing, it's
about actual spaces being deleted in the editor and then kept deleted when you
save the file.

~~~
recoiledsnake
Maybe I need to spell it for you and those downvoting the OP. The headline
says 'Open XML' (implying a bug in the standard format) whereas the article is
talking about Office 2007 and Office 2010. See the difference?

~~~
rbanffy
It's the implementation of the standard in Office 2007 and 2010

~~~
recoiledsnake
So a more appropriate title could be 'Microsoft's Office embarassment' ? Or
not?

~~~
rbanffy
I think MS Office Open XML and MS Office are pretty much very related. Nobody
would implement that standard if it weren't for Microsoft pushing it.

And the title implies that, apart from MS Office Open XML, Office is not
embarrassing. One may disagree with that implication, but the title is more
specific.

~~~
recoiledsnake
There is a big difference between a file format and the program that makes use
of it(even though they are related and even if the same entity developed both)
that you're failing to grasp in multiple comments.

~~~
rbanffy
If you read the title carefully, you may also interpret as Microsoft's Open
XML embarrassment. Since I am unaware of other Microsoft products that
implement the standard, I see no problem with the title. It's embarrassing,
it's embarrassing to Microsoft and it's about MS Office Open XML.

