I'm quite surprised that the markup incompatibility with Microsoft Office is sti...

buovjaga · on Jan 18, 2018

I don't think they need to actively sabotage anything and I would not want to dwell on those lines of thinking. I think a large part of the open issues are in corner case territory or due to the freedom of users to go wild with their document layouts, creating a mess of nested frames in tables etc. Let's roll up our sleeves and fix them anyhow :) !!

Here is our meta report for MSO formats: https://bugs.documentfoundation.org/showdependencytree.cgi?i...

"depends on 1394 open bugs" - this includes all the further meta reports.

kjhughes · on Jan 18, 2018

The OOXML specification is over 5K pages long. Even Microsoft's own online products do not implement it entirely. One should not be surprised that outside parties may face challenges achieving 100% implementation and interoperability.

merb · on Jan 18, 2018

and even word/excel has discrepancies between 2003, 2011, 2013 and 2016 especially between 2011 and 2013. Mostly overlapping layouting is differently.

kakwa_ · on Jan 19, 2018

It's really difficult to achieve.

The article mentions EMF/EMF+. this format is basically a list of calls to GDI.h and is a bitch to map to other graphic stacks when you are not on Windows.

The specification of the format is public (kudos for Microsoft), it helps a lot. But there are a lot of corner cases and the spec can be quite hard to understand some times. Computing correctly the origin is tricky for example (VIEWPORTORGEX, WINDOWORGEX etc).

For some stuff, it's impossible to get it right.

One example that comes to my mind is a text encoding bug I got a year ago (I maintain an emf to svg conversion library). It took me one week to track it down. In most cases the strings are UTF-16. But in weird cases, when the ETO_GLYPH_INDEX flag is set, the "encoding" is directly the index of the glyph inside the selected font.

It's not the most portable way to handle text... If you don't have the exact same font on your computer, there is a good chance the text will not be displayed correctly.

And converting back to a well known encoding is tricky (using the cmap tables of the ttf file, you have to build a reverse cmap, pray there is a 1-1 mapping between glyph and unicode, and convert back to utf-8 or something).

Even Microsoft itself, on the Mac OS, version got it wrong.

chris_wot · on Jan 19, 2018

Can you link to that library?

buovjaga · on Jan 19, 2018

I found it in his submission history: https://github.com/kakwa/libemf2svg

kakwa_ · on Jan 19, 2018

thanks ^^

For information, the bits of code that is handling "font index encoding":

https://github.com/kakwa/libemf2svg/blob/master/src/lib/emf2...

To

https://github.com/kakwa/libemf2svg/blob/master/src/lib/emf2...

chris_wot · on Jan 19, 2018

I think we might a similar issue - thanks!

readams · on Jan 18, 2018

Realistically LibreOffice is not really relevant competition for Microsoft. Google Docs is much more important, even though the capabilities are so widely disparate.

adrianN · on Jan 19, 2018

I guess that depends on your filter bubble. I know many more people who use LibreOffice compared to Google Docs. Most people I know don't trust the cloud with their important documents.

ozim · on Jan 19, 2018

I do not trust cloud even with unimportant documents. Especially if you can be locked out of your cloud account without reason. I only keep non critical offsite backups on cloud drives.

mikekchar · on Jan 18, 2018

A very long time ago, I used to work on document importing and exporting on Word Perfect so I'm relatively familiar with the problem in general. I haven't looked at this stuff in almost 20 years, so I have no idea about specifics at this point :-)

What people generally don't understand is that the save file formats do not describe what the output will look like. Back in the old, old, old days you would write a word processor and to "save" the file, you would just write a binary image of your data structures. That was what the doc file format was originally.

Although, people didn't tend to program this way back then, imagine a Rails server. Now imagine that you handed over your database (with tables that correspond to each model object). That's kind of analogous to a "save" file. In order to render the data, you need the business logic (which, in Rails, is usually housed in the controllers) and you need the rendering logic (which is usually housed in the views). Just having a description of the database tables doesn't really help me all that much in rendering the data as you can see.

Now imagine that we take a different Rails app that happens to do the same kind of thing. Some people say, "Oh it would be swell if you could import the data from the first Rails app and render it in yours. You are basically doing the same app, so it should be easy". But it's obviously not -- I wrote my controllers and views completely differently than the other Rails app. My model objects contain completely different data and are structured completely differently. Even worse -- I don't even render things the same way. If I want to support this functionality, I basically need to completely rewrite my rails app to be exactly the same as the previous Rails app.

And this is essentially the problem. If you want to render Word documents the way Word does, you basically have to rewrite Word in your word processor. If you also want to support your format then you have to maintain 2 word processors. If you want to supports some other file formats? It just gets worse and worse.

So what you could do is completely rewrite the rendering engine so that it is more flexible. But the problem is that this is a massive undertaking and nobody will believe you if you tell them that it is necessary. Even worse, there is a surprising amount of variation in rendering. For example, how do you deal with page breaks in footnotes? Most people don't write footnotes that are longer than a page (if you look at my posting history, you might suspect that I'm one of those people, but I digress). However, it's very common in the legal field. There is a specific correct way to break pages in footnotes and Word historically has not done it correctly. They do it differently. This is actually what kept Corel/Word Perfect in business for a long time -- in order to print the document, you would have to use Word Perfect and since the file conversion was crap, you pretty much had to keep it as a Word Perfect file forever.

I will say that while I worked on Word Perfect, Microsoft was very helpful in explaining how their formatter worked. They even regularly sent us bug reports when our import filters had regressions, etc. I've never believed that they intentionally obfuscated the process. It's just a difficult problem.

mmeeks · on Jan 19, 2018

Wow - some great experience; it would be great to have your help improving the Document Liberation wpd filters - we do a great job there, but no doubt improvements are always possible - particularly with better understanding =)

beamatronic · on Jan 18, 2018

Is compatibility more of a problem with .doc or .docx? .doc shouldn't still be evolving, I'd imagine.

_delirium · on Jan 18, 2018

.doc is more stable, but it's also much more onerous to be fully compatible with. .docx was at least designed as a reasonably modern, rational document format with a specification. .doc by contrast evolved from generations of MS Word that for efficiency saved files by doing raw dumps of internal data structures as binary blobs, which got more and more complicated as things accreted over the years and versions. And there was no public documentation of the very complex binary format until 2008. Joel Spolsky has some on that here: https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...

nradov · on Jan 18, 2018

The JPEG file formats are nearly as bad, especially once you get into camera vendor extensions. It was common practice back then. Just take a bunch of C structs and dump them from memory to disk.

pbhjpbhj · on Jan 18, 2018

Have Microsoft completely solved compatibility between different/same versions of MS Office now?

pletnes · on Jan 18, 2018

Nope, I ran into word documents that word could not render just last month. And I hardly use word; I edit about 1-5 docs per month.

Spooky23 · on Jan 18, 2018

It took my enterprise almost a year to iron out the issues to move from 2010/2013 to ProPlus/2016.

Lordarminius · on Jan 18, 2018

::Are they actively sabotaging the Open Document Foundation, or is it just really difficult to achieve?

The former is a more likely scenario

tadfisher · on Jan 18, 2018

I'm not surprised at all that someone on HN replies to a positive story about some open-source product with unrelated bug complaints, then implies the bugs are due to malice on the part of the developers.

cabaalis · on Jan 18, 2018

My worst downvotes have been when I'm referring directly to the motivations or knowledge of the commenter. My second-worst downvotes are when I'm opining on something I know nothing about. My suggestion (to myself, and anyone else interested) is to avoid these two pitfalls.

emteycz · on Jan 18, 2018

I think he meant Microsoft...

tadfisher · on Jan 18, 2018

d'oh! I'll leave my comment up as a warning to others to reread carefully before getting annoyed.