

Ask HN: Why is docx 3 times smaller in size than pdf? - ronzensci

I've noticed that the size of a docx file is now almost 3 times smaller in size than the corresponding pdf saved in Word 2010? Is this something MS is doing consciously to prevent people from saving in PDF?
======
Piskvorrr
1) depends on content - is this just text with some formatting? Does the PDF
embed fonts? Are there images? Etc.

2) docx format is wrapped in a zip file, so there is built-in default
compression; PDF is not compressed by default

3) does it matter? Most documents I see these days are <1 MB; assuming that
universal compression ratio of 3 is achieved, the PDF file would be 3 MB large
- not very significant in absolute numbers.

I'd say it's not a conspiracy, just a different format - and that the size
difference is not very significant. (Now, a 200 MB docx versus a 600 MB pdf
might be significant; but if you are routinely handling 200 MB documents, I'd
suspect Something Is Horribly Wrong anyway).

------
lutusp
This is actually relevant, though it may not seem so at first glance. The
ratio between well-compressed and not-compressed document types declines as
the quality of the writing improves.

Obviously one of these document types routinely uses compression. If the user
composes prose carefully, and avoids repetition, the contrast you cite may be
much less.

The gold standard for compressible text is political speeches -- they tend to
say almost nothing with as many words as possible. I've always thought part of
the decision process leading to voting for someone should include an attempt
to compress his speeches.

~~~
ronzensci
hahaha.. Very interesting points, conveyed with humor but have a strong base
in CS fundamentals. I guess my original question was not very well framed. I
am writing a CS conference paper (on the ACM SIG template) and four pages in
docx is 90 KB and the same saved as PDF came to 260 KB. I was so far using a
much older version of Word and only recently upgraded to Word 2010. Hence, my
older pdfs used to be converted via a free utility and for the first time used
the Word 2010 pdf converter.

I loved your analogy in political speeches. I actually have a friend who
tracks politician (on www.mumbaivotes.com) - I might actually share this idea
of your of compressing political speeches to identify the uniqueness in what
is said in the speech ;-)

~~~
lutusp
Oh -- I forgot to ask an important question: what happens when you compress
the PDF? If you can compress the PDF so it more closely equals the word
document in size, the practical difference between them sort of evaporates, at
least as to storage size.

To me the advantage of a PDF over a Word document is that the former is now
open-source. And it is universally regarded as the "correct" format for
technical writing.

> ... I might actually share this idea of your of compressing political
> speeches to identify the uniqueness in what is said in the speech ;-)

The intriguing part is that it's an objective measure, in a field that's
called "political science" but isn't remotely scientific. I'd like to see a
thorough study comparing political speeches in this easily conducted, almost
automatic way.

Another, perhaps more difficult, study would analyze advertising copy for use
of words that have no objective meaning, words that appeal only to emotion. My
favorite advertising word is "zesty", a frequently-used word with no
information content whatever.

------
scholia
I don't think the idea behind offering you the PDF option is to "prevent" you
from using it....

In fact, it was Adobe that tried to prevent you from saving to PDF. It
threatened Microsoft with legal action (especially in Europe) for trying to
offer the "open" PDF format in Office 2007. This was widely reported at the
time, but see, for example:

[http://betanews.com/2006/06/02/adobe-to-sue-microsoft-for-
pd...](http://betanews.com/2006/06/02/adobe-to-sue-microsoft-for-pdf-feature/)

~~~
fredsanford
How else could Adobe keep milking people for Adobe Professional?

Cost quote from them in 2009/Early 2010: 1K to 2K per user depending on
features.

I passed.

------
gadders
Just a guess, but doesn't PDF have to embed font info, and DOCX don't?

