Or, you can use a diff command which understands word diffs. git has '--word-diff', svn unfortunatly doesn't have such an option, but 'http://www.sable.mcgill.ca/~cpicke/swd/ provides a nice script (there are others). Don't know about mercurial.
I hate working with authors who try to force line breaks into text unnaturally. I have heard many justifications for it over the years, but I find it hard to understand why anyone would do it, other than because their tools (soft word wrap in the editor, word-based diffing) are terrible.
The original article did say "In the tutorial, I ask students whether or not the Sphinx text files in their project will be read by end-users."
I certainly would not send a file with 'semantic linefeeds' to anyone else. I use markdown quite a lot and LaTeX a little, so I'm happy with
source file -> formatter -> file for reading
The humble fmt (Linux) deals with a file with 'semantic line feeds' and produces reasonable paragraphs. I think I will try this out as a lot of text editors have line oriented tools. It might be that the result of a lot of rearranging that way will resemble a Burroughos cut-up, but I shall see.
I'm sure someone will come up with a regexp that can take a 'standard' text file and split the lines on full stops and commas to restore an edited readable text file to 'semantic linefeed' source, thus allowing round trip copy editing.
> I'm sure someone will come up with a regexp that can take a 'standard' text file and split the lines on full stops and commas to restore an edited readable text file to 'semantic linefeed' source, thus allowing round trip copy editing.
Yes to the first, no to the second. The reason is that the act of recombining lines into paragraphs makes the assumption that lines broken by single linefeeds need to be merged into a paragraph. But in text, a list of items, meant to be read as a list, is also broken by single line feeds, and must not be turned into a paragraph.
One often sees posts here by beginners that include a list of items, but the rendered version assembles the list into a (typically unreadable) paragraph. More experienced hands know to break the list up with double linefeeds to defeat the "intelligent" reformatting algorithm.
The bottom line is that a recombining algorithm cannot distinguish a list of items from a paragraph of individual lines. The act of breaking text into lines loses information irretrievably.
Allow me a prediction: All these conventions that break text into individual lines and then try to reassemble them, i.e. this forum, the e-mail convention, and a thousand other examples, will eventually be abandoned in favor of leaving the text alone. This will happen when people realize they're throwing away information that cannot be recovered.
When I wrote Apple Writer in the late 1970s, the first change I made to common practice was retain the paragraph structure people naturally used in entering text (even though the displayed text was broken into lines on word boundaries). At the time, this was a bigger departure than it is now, and it helped make my program successful. But if I had been told then that people would still be defending the practice of breaking text into individual lines 35 years later, I would have laughed out loud.
point taken for plain text files, but I use markdown so list items have asterisks at beginning of the line, so I have a pattern to distinguish list items from flowing text.
"But if I had been told then that people would still be defending the practice of breaking text into individual lines 35 years later, I would have laughed out loud."
Laughing is good for you! If I read you right, you invented the soft line wrap? Excellent!
If the merging isn't also word-aware, then surely this is only half a solution. I don't think the svn script you linked is able to merge (I only read the Readme, no further investigation).
I find myself adapting my coding style in order to produce prettier diffs (for code reviewers). For example, I may insert extra vertical whitespace or place a new function's definition where it won't be melded with unrelated diff chunks.
I've played around with Bram Cohen's "patience" diff algorithm, but I can only remember a couple times when it produced a better diff.
"git add -p" is wonderful for this. In a nutshell, git has line-level granularity for committing, instead of file-level granularity (I know it's technically more complex than that). When you add a file for committing with -p, you interactively choose only the diff hunks you want and leave the others behind for a future commit. Plus you can split the hunks if they aren't fine-grained enough for you.
I beg to differ: One paragraph per line, please. The natural lexical unit is not the sentence, but the paragraph. These three sentences belong together as a unit, and should be separated from other paragraphs by a double linefeed.
If someone wants to later break paragraphs into separate sentences for some reason, it's child's play, and that option is implicit in this formatting. But if someone wants to reassemble individual sentences into paragraphs, as anyone knows who has tried to reassemble lines into paragraphs (as with an e-mail in its delivered form), it's nearly impossible to get right.
Complete thoughts reside in paragraphs, groupings of sentences, not in the sentences. One paragraph per line, please.
Diff and merge algorithms are line-oriented. If a whole paragraph is on one line, and I edit one word, then the diff consists of the whole paragraph, which is bad. Some tools are able to do word-by-word diffs, as mentioned below. But none are able to merge correctly in the scenario where two branches have each edited one word in the same line.
I rarely edit Wikipedia so I don't know what its diff algorithm does. But I'm pretty sure it doesn't do any merging in any scenario, so that partly explains why one clause per line would not help there. That's quite different from Sphinx and TeX, which are usually stored in merge-capable version control. I think all of Wikipedia's justifications are specific to the case of editing in a text-box on the web and without diff/merge algorithms.
> if someone wants to reassemble individual sentences into paragraphs, as anyone knows who has tried to reassemble lines into paragraphs (as with an e-mail in its delivered form), it's nearly impossible to get right.
The post proposes a single line-break after each clause or sentence, and then a double line-break after each paragraph. TeX and Sphinx do the right thing in those cases.
No, diff and merge algorithms are symbol-oriented.
For reasons that are as historical as they are technical, most diff and merge programs choose to break documents up on a line level of granularity in order to produce the symbols that are passed into the algorithm. But that's a design decision, not a technical one.
A diff/merge program that operates at a word level of granularity should be just as capable of handling two words edited on the same line as a line-oriented diff program is of handling two lines edited in the same function.
> The post proposes a single line-break after each clause or sentence, and then a double line-break after each paragraph.
It's a lot easier to rewrite the typographical conventions a piece of software conforms to than it is to rewrite the typographical conventions that millions of humans grew up using. Teaching the diff/merge program to recognize that CRLF isn't the only text boundary out there would achieve the same effect* at much lower cost.
*I realize that abbreviations complicate it somewhat. I'd submit, though, that if a basic diff/merge program is being relied on too closely in a scenario where that actually causes any consequential problems then the real error might be between Mr. Diff User's keyboard and chair. For normal diff usage 'failed' symbol boundary determinations like that are fine, the same as how a traditional line-oriented diff program doesn't critically suffer from the way it would interpret me inserting a carriage return into a line of code.
> No, diff and merge algorithms are symbol-oriented.
Point taken, and as soon as someone puts word-by-word merging into svn or git, I'll change my opinion.
> I'd submit, though, that if a basic diff/merge program is being relied on too closely in a scenario where that actually causes any consequential problems
[EDIT removed some response -- maybe "that" referred to a narrower scenario than I thought and parent didn't intend any insult.]
> If a whole paragraph is on one line, and I edit one word, then the diff consists of the whole paragraph, which is bad.
Yes, bad although logical. It's a shame we can't have one optimal convention for all common text that appears naturally in paragraph form. Absent diff and similar programs, lexical units consisting of paragraphs is the obvious choice.
> The post proposes a single line-break after each clause or sentence ...
Yes, which is the convention used in e-mail and elsewhere. But it throws away formatting information that can't be recovered (see below for the reason). Wouldn't it be better to revise diff so that it presents a subset of a paragraph containing the difference text, instead of having to break up the source document just to make diff happy?
Obviously if the text diff processes consists of programming source files, this issue may not be important. But in the general case, text naturally consists of paragraphs, not sentences, and to change text to make diff happy puts the cart before the horse.
The problem with recombining broken text is never more obvious than the case of a paragraph, broken into lines that will need to be merged, followed by a list of items meant to appear as individual lines, that should not be merged. A merge algorithm cannot distinguish the two cases in a deterministic way.
> a list of items meant to appear as individual lines, that should not be merged.
Do you literally mean a list, like LaTeX itemize, description, or enumerate? Then the markup (whatever it is) should indicate that. In plain text that was going to remain in plain text form, I would put list items with double line breaks between them, and leading asterisks. This avoids any ambiguity. I don't know any other scenario where I would want the items to appear as individual lines, but not as paragraphs.
> Wouldn't it be better to revise diff so that it presents a subset of a paragraph containing the difference text
As I already said, some diff algorithms already do that, but as far as I know there are NO merge algorithms that do it. (BTW just in case of terminological confusion: when I say "merge algorithm" I'm referring to version control-style merging of edits; when you say "merge algorithm" you're referring to paragraph-merging, the process of putting multiple lines together into a single paragraph.)
But if someone wants to reassemble individual sentences into paragraphs, as anyone knows who has tried to reassemble lines into paragraphs (as with an e-mail in its delivered form), it's nearly impossible to get right.
Is it? I haven't seen LaTeX do it wrong yet. Of course, I use actual formatting code to specify format -- what I'd give to someone else is the rendered document, not the LaTeX source.
FreeBSD has a strict rule that each sentence should always begin a new line. The reason isn't so much to simplify editing (not relevant with modern editors) or to make diffs more compact (size hardly matters); rather, the biggest reason is to make "svn blame" work better.
This is a topic that has often bothered me when collaborating with people on writing latex.
I use Emacs with auto-fill-mode and Meta-Q to fill paragraphs. This usually works out ok in my own files because usually a re-fill of existing text only affects a few lines.
When other people get involved, diffs and merges are ruined, as the article says.
But the article's solution sounds like a lot of work. Do I have to manually break lines that get longer than 80 characters (or whatever my limit is)? Am I supposed to turn soft word wrap on?
I think the right solution is to rebind Meta-Q in Emacs to some magic command that refuses to reflow any text which is reported as unmodified by the version control, but does reflow new/modified text according to the article's rules, and also imposes an 80-character limit.
I would just use soft word wrap - well, I always have soft wrap on, but especially in this case it seems simplest to me to let long sentences/clauses soft wrap and retain the mapping from line to semantic unit.
> Do I have to manually break lines that get longer than 80 characters (or whatever my limit is)? Am I supposed to turn soft word wrap on?
In most situations, there should be some sort of semantic break in your sentences every 80 characters or less. (It may not be demarcated with a comma.) If there isn't, you may want to consider reworking those sentences for clarity.
Yes. When writing LaTeX, I put line breaks in after phrases and clauses, as well as at every sentence end. I'm following exactly Kernighan's advice from old nroff documentation.
As a result, my text is very ragged right. I don't even notice it because I'm concentrating on meaning, not form.
Or you could just set your line wrap margin pretty low, e.g. 50 or 60 characters, and then when you edit don't re-wrap. Requires less thinking about where to break lines, while still keeping most changes localized to a line and thus cleaner diffs.
I love the PWB! [line break]
Sometimes the best stuff does not have the best marketing.
Any chance the author can post a copy of the documentation his father had saved for the Documenter's Work Bench?
So many UNIX utilities are line-based, paragraphs just complicate things. [line break]
Yet we still type in paragraphs.
You can take the above text and feed it through fmt (one of my favorite utilities) and you get an opening paragraph with two sentences, a single line paragraph, and a final paragraph with two sentences. You can control the line length too. Want 40-column output for better readability? Easy, when using fmt. But you need input that is single lines.
Have you ever ran PDF's through pdftotext or pdftohtml and been frustrated by the formatting? Line breaks from hell.
If documents were distributed in the format the author describes we could convert them into PDF's and other pretty printing formats. But converting from these "paragraphed" formats into readable plain text can be a real nuisance.
'Ventilated Prose' was a term used by Buckminster Fuller. The blog author linked above is drafting rapidly, then adding line breaks after each sentence/clause as an editing aid. He mentions Vi and the use of dif
I've been using semantic line breaks for a while for LaTeX since, using Vim, this is the only editing style that is sane on default settings. But that shouldn't necessarily be true; what can I do to make vim friendlier to work with files that are one-paragraph-per-line or similarly formatted? I'm not particularly concerned about version control just editing.
I found the incredibly thin page unreadable (had to use Clearly). One sentence per line makes sense for the "source" that you edit, but use something like Latex or Markdown so that we don't have to read it in that form.
This clown doesn't even know what a sentence is. In his first example, he has 1/5 sentences per line. Then he changes it to 1/7 sentences per line, i.e. he moves away from his supposed "one sentence per line" target. Yes, in the text of the article he admits that maybe he was thinking about clauses, but I have a zero tolerance policy towards objectively wrong titles.
And of course, the real solution to the problem of "fussing with the lines of each paragraph so that they all end near the right margin" is to use a text editor that soft-wraps lines. Yes, computers have recently become powerful enough to make that possible while editing the document! Amazing.
The title is One sentence per line, but he clarifies to include clauses in the body of the article. This also has nothing to do with where the margin ends, but editing text in developer formats that is later transformed to end-user formats in a way that gels well with the tools of the Unix environment. Your comment is unnecessarily venomous and not representative of the contents of the article.
Have you noticed how Github helpfully shows which words have changed inside a line when looking at diffs? If we stop teaching people to bend over backwards to accomodate 70s technology, maybe we'll have more young hackers fixing our tools.
> Have you noticed how Github helpfully shows which words have changed inside a line when looking at diffs?
GitHub's web UI only does a line diff. Which is not particularly helpful when you change a word or two in a six sentence paragraph. It's possible to do a word diff locally of course: `git diff --word-diff` but that's not the general use case for a code host, right now most code is line oriented and the UI suits it well. That is one of the reasons that the article advocates formatting thoughts with newlines, good portability. These input texts are closer to code than prose, so why format it like it was? Splitting thoughts into units digestible by your coding environment has the huge benefit of working with individual thoughts instead of individual paragraphs.
Diffs are only half the picture. Is git able to correctly merge two branches which each have single-word edits to a long-line sentence? I'm pretty sure it's not. I think the article is correctly arguing that we should continue to bend over backwards to accommodate 2012 technology.
I hate working with authors who try to force line breaks into text unnaturally. I have heard many justifications for it over the years, but I find it hard to understand why anyone would do it, other than because their tools (soft word wrap in the editor, word-based diffing) are terrible.