Hacker News new | past | comments | ask | show | jobs | submit login
The Plain Person’s Guide to Plain Text Social Science (plain-text.co)
67 points by jboynyc on Mar 4, 2016 | hide | past | favorite | 21 comments

>"as opposed to binary file formats like .docx"

.docx is not a binary file format unlike the .doc files from the past.

.docx is a plain text xml based format that has an Open published standard.

Most people don't know this, but you can simply unzip the contents of a docx or xlsx file and inspect its plain text contents.

Archival and long term accessibility of the document contents should not be the deciding factor in opting out of .docx

>.docx is a plain text xml based format

I realize you're trying to elucidate but the context of that author's sentence was "text editors" therefore things like vi/emacs/TextPad/Notepad do not understand a binary format such as a zip file. (And docx is a zip file as you already noted.)

Yes, inside the binary zip file is a text file called "document.xml" but none of those plain text editors will parse a zip file to get to that xml file. From the viewpoint of plain text editors, a docx file is an opaque binary format.

Yeah, i see but my line of thinking was more along: "if I have this docx in 100 years time will I be able to decode its contents?"

I think the answer is probably yes (pending possible " "premature" apocalyptic scenarios =)

Regarding the zip compression, docx is a "Document Container File" standardized under ISO/IEC, so provided we still have computers and the standard documentation is still know, it should in theory be possible to DEFLATE the file.

Inside you would find all the text content as xml, and as some have pointed out here, binary blobs as well. Those would be mostly font files and image files (or am I missing some others that would be important?). In my opinion the fonts would not be critical to the content meaning.

The images would of course depend on what format they were encoded in. But that's a whole other discussion!

The stock emacs that ships with osx is not smart enough to dig into a docx file, but the version from here: http://emacsformacosx.com is.

> .docx is a plain text xml based format that has an Open published standard.

With binary blobs in it. And nearly any document created in a MS product is unlikely to validate under it's published schema.

I've been following most of Kieran Healy's workflow recommendations for a while now.

I hope these practices get more and more popular.

I find it ironic that the first line is "Get the guide as a single PDF file.". :)

Why? Part of the joy of plain text is that it's simple to convert your content to other formats.

I just found it humorous. I totally understand, and practice, the plain text approach (LaTeX, RMarkdown; now: Jekyll, see: http://p.migdal.pl/2015/12/02/first-post.html).

Browsed around to your D3 workshop writeup, it's a pretty amazing resource. Thanks :)

I wonder whether in an ideal world there would be an alternative to git that doesnt have git's long learning curve for use by scholars in quantitative disciplines.

Mercurial is a lot more intuitive, fossil is underrated. Both are probably a better choice for work like this.

Why go through all this trouble if you are a social science student? If I were not into computing, I would probably just use Google Docs for all the drafts, and for the final version, I would copy+paste into Word and give it a finishing touch (proper formatting).

(That last step should not be necessary really, because that is the publisher's task, but publishers seem to be getting away with being lazy).

I summarize this to my students using two laws and two postulates.

Two laws:

(1) For long documents, especially if they require citations, bibliographies, mathematical notation, etc. – Word is a poor choice.

(2) For documents that take a long time to write and that you want to survive a long time – Word is a poor choice.

The problem is that the alternative is to learn a new way of doing things, which may have a long learning curve. So don't kill your research by wasting time learning new tools. But if you can spare some time to learn better tools, you should.

To understand the alternative, you need to internalize the two postulates:

(1) You should use tools that focus on the semantic aspects of the text, not it's visual appearance. For example: tools that encourage you to say “this is a chapter heading” are good, tools that encourage you to sat “this is Arial-14” are bad.

(2) You should save your work in standard file formats. For text, which is most of what we produce, this means TEXT FILES, not Word documents (.doc,.docx) that tend to break when they switch versions.

> You should use tools that focus on the semantic aspects of the text, not it's visual appearance. For example: tools that encourage you to say “this is a chapter heading” are good, tools that encourage you to sat “this is Arial-14” are bad.

All modern text processing systems can do that, including Word and Google Docs.

> that tend to break when they switch versions.

I don't see the problem. I have two other postulates. (1) While you are writing research papers or even a thesis, you should not change the version of your word processor (2) While writing, you should focus on the content, not on the formatting.

Follow these laws, and you will do just fine with any decent word processor.

As the saying goes, good luck with that.

Indeed with strict discipline you can use semantic styles in Word and other tools. People typically don't and the default behavior in many cases is annoying or leads to problems (copying style, when you intend to copy just the text and so on). And, fiddling with these things takes a lot of time.

As for software versions, moving operating systems, computer crashes and the like -- things happen when you are working on a dissertation and any other long term project, and you end up trying to rescue your files. Moreover, when you are a scholar, you often find yourself needing stuff you prepared decades earlier. In many cases the versions are not even easily available, if at all (Word 2.0, anyone?)

There are tons of sites by academics detailing these scenarios with more details. I submitted a few links to HN a few minutes ago.

One advantage of plain text is that git can easily track the changes.

Google Docs can do that too, with a few mouseclicks.

Google Docs is subject to an external service.

If you write in org and want to export through pandoc there's org-pandoc https://github.com/robtillotson/org-pandoc .

Nice way to integrate this kind of workflow with Zotero is zottxt https://gitlab.com/egh/zotxt .

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact