
The Plain Person’s Guide to Plain Text Social Science - jboynyc
http://plain-text.co/
======
glaberficken
>"as opposed to binary file formats like .docx"

.docx is not a binary file format unlike the .doc files from the past.

.docx is a plain text xml based format that has an Open published standard.

Most people don't know this, but you can simply unzip the contents of a docx
or xlsx file and inspect its plain text contents.

Archival and long term accessibility of the document contents should not be
the deciding factor in opting out of .docx

~~~
jasode
_>.docx is a plain text xml based format_

I realize you're trying to elucidate but the context of that author's sentence
was " _text editors_ " therefore things like vi/emacs/TextPad/Notepad do not
understand a _binary format_ such as a zip file. (And docx _is_ a zip file as
you already noted.)

Yes, _inside_ the binary zip file is a text file called "document.xml" but
none of those plain text editors will parse a zip file to get to that xml
file. From the viewpoint of plain text editors, a docx file is an _opaque
binary format_.

~~~
glaberficken
Yeah, i see but my line of thinking was more along: "if I have this docx in
100 years time will I be able to decode its contents?"

I think the answer is probably yes (pending possible " "premature" apocalyptic
scenarios =)

Regarding the zip compression, docx is a "Document Container File"
standardized under ISO/IEC, so provided we still have computers and the
standard documentation is still know, it should in theory be possible to
DEFLATE the file.

Inside you would find all the text content as xml, and as some have pointed
out here, binary blobs as well. Those would be mostly font files and image
files (or am I missing some others that would be important?). In my opinion
the fonts would not be critical to the content meaning.

The images would of course depend on what format they were encoded in. But
that's a whole other discussion!

------
djhn
I've been following most of Kieran Healy's workflow recommendations for a
while now.

I hope these practices get more and more popular.

------
stared
I find it ironic that the first line is "Get the guide as a single PDF file.".
:)

~~~
sooheon
Why? Part of the joy of plain text is that it's simple to convert your content
to other formats.

~~~
stared
I just found it humorous. I totally understand, and practice, the plain text
approach (LaTeX, RMarkdown; now: Jekyll, see:
[http://p.migdal.pl/2015/12/02/first-
post.html](http://p.migdal.pl/2015/12/02/first-post.html)).

~~~
sooheon
Browsed around to your D3 workshop writeup, it's a pretty amazing resource.
Thanks :)

------
hollerith
I wonder whether in an ideal world there would be an alternative to git that
doesnt have git's long learning curve for use by scholars in quantitative
disciplines.

~~~
phren0logy
Mercurial is a lot more intuitive, fossil is underrated. Both are probably a
better choice for work like this.

------
amelius
Why go through all this trouble if you are a social science student? If I were
not into computing, I would probably just use Google Docs for all the drafts,
and for the final version, I would copy+paste into Word and give it a
finishing touch (proper formatting).

(That last step should not be necessary really, because that is the
publisher's task, but publishers seem to be getting away with being lazy).

~~~
ehudla
I summarize this to my students using two laws and two postulates.

Two laws:

(1) For long documents, especially if they require citations, bibliographies,
mathematical notation, etc. – Word is a poor choice.

(2) For documents that take a long time to write and that you want to survive
a long time – Word is a poor choice.

The problem is that the alternative is to learn a new way of doing things,
which may have a long learning curve. So don't kill your research by wasting
time learning new tools. But if you can spare some time to learn better tools,
you should.

To understand the alternative, you need to internalize the two postulates:

(1) You should use tools that focus on the semantic aspects of the text, not
it's visual appearance. For example: tools that encourage you to say “this is
a chapter heading” are good, tools that encourage you to sat “this is
Arial-14” are bad.

(2) You should save your work in standard file formats. For text, which is
most of what we produce, this means TEXT FILES, not Word documents
(.doc,.docx) that tend to break when they switch versions.

~~~
amelius
> You should use tools that focus on the semantic aspects of the text, not
> it's visual appearance. For example: tools that encourage you to say “this
> is a chapter heading” are good, tools that encourage you to sat “this is
> Arial-14” are bad.

All modern text processing systems can do that, including Word and Google
Docs.

> that tend to break when they switch versions.

I don't see the problem. I have two other postulates. (1) While you are
writing research papers or even a thesis, you should not change the version of
your word processor (2) While writing, you should focus on the content, not on
the formatting.

Follow these laws, and you will do just fine with any decent word processor.

~~~
ehudla
As the saying goes, good luck with that.

Indeed with strict discipline you can use semantic styles in Word and other
tools. People typically don't and the default behavior in many cases is
annoying or leads to problems (copying style, when you intend to copy just the
text and so on). And, fiddling with these things takes a lot of time.

As for software versions, moving operating systems, computer crashes and the
like -- things happen when you are working on a dissertation and any other
long term project, and you end up trying to rescue your files. Moreover, when
you are a scholar, you often find yourself needing stuff you prepared decades
earlier. In many cases the versions are not even easily available, if at all
(Word 2.0, anyone?)

There are tons of sites by academics detailing these scenarios with more
details. I submitted a few links to HN a few minutes ago.

------
ehudla
If you write in org and want to export through pandoc there's org-pandoc
[https://github.com/robtillotson/org-
pandoc](https://github.com/robtillotson/org-pandoc) .

------
ehudla
Nice way to integrate this kind of workflow with Zotero is zottxt
[https://gitlab.com/egh/zotxt](https://gitlab.com/egh/zotxt) .

