I hate working with authors who try to force line breaks into text unnaturally. I have heard many justifications for it over the years, but I find it hard to understand why anyone would do it, other than because their tools (soft word wrap in the editor, word-based diffing) are terrible.
I certainly would not send a file with 'semantic linefeeds' to anyone else. I use markdown quite a lot and LaTeX a little, so I'm happy with
source file -> formatter -> file for reading
I'm sure someone will come up with a regexp that can take a 'standard' text file and split the lines on full stops and commas to restore an edited readable text file to 'semantic linefeed' source, thus allowing round trip copy editing.
Yes to the first, no to the second. The reason is that the act of recombining lines into paragraphs makes the assumption that lines broken by single linefeeds need to be merged into a paragraph. But in text, a list of items, meant to be read as a list, is also broken by single line feeds, and must not be turned into a paragraph.
One often sees posts here by beginners that include a list of items, but the rendered version assembles the list into a (typically unreadable) paragraph. More experienced hands know to break the list up with double linefeeds to defeat the "intelligent" reformatting algorithm.
The bottom line is that a recombining algorithm cannot distinguish a list of items from a paragraph of individual lines. The act of breaking text into lines loses information irretrievably.
Allow me a prediction: All these conventions that break text into individual lines and then try to reassemble them, i.e. this forum, the e-mail convention, and a thousand other examples, will eventually be abandoned in favor of leaving the text alone. This will happen when people realize they're throwing away information that cannot be recovered.
When I wrote Apple Writer in the late 1970s, the first change I made to common practice was retain the paragraph structure people naturally used in entering text (even though the displayed text was broken into lines on word boundaries). At the time, this was a bigger departure than it is now, and it helped make my program successful. But if I had been told then that people would still be defending the practice of breaking text into individual lines 35 years later, I would have laughed out loud.
"But if I had been told then that people would still be defending the practice of breaking text into individual lines 35 years later, I would have laughed out loud."
Laughing is good for you! If I read you right, you invented the soft line wrap? Excellent!
I've played around with Bram Cohen's "patience" diff algorithm, but I can only remember a couple times when it produced a better diff.
I beg to differ: One paragraph per line, please. The natural lexical unit is not the sentence, but the paragraph. These three sentences belong together as a unit, and should be separated from other paragraphs by a double linefeed.
If someone wants to later break paragraphs into separate sentences for some reason, it's child's play, and that option is implicit in this formatting. But if someone wants to reassemble individual sentences into paragraphs, as anyone knows who has tried to reassemble lines into paragraphs (as with an e-mail in its delivered form), it's nearly impossible to get right.
Complete thoughts reside in paragraphs, groupings of sentences, not in the sentences. One paragraph per line, please.
A quote: "Do not use manually entered hard line breaks within paragraphs when editing articles."
A long list of justifications follows in the article.
I rarely edit Wikipedia so I don't know what its diff algorithm does. But I'm pretty sure it doesn't do any merging in any scenario, so that partly explains why one clause per line would not help there. That's quite different from Sphinx and TeX, which are usually stored in merge-capable version control. I think all of Wikipedia's justifications are specific to the case of editing in a text-box on the web and without diff/merge algorithms.
> if someone wants to reassemble individual sentences into paragraphs, as anyone knows who has tried to reassemble lines into paragraphs (as with an e-mail in its delivered form), it's nearly impossible to get right.
The post proposes a single line-break after each clause or sentence, and then a double line-break after each paragraph. TeX and Sphinx do the right thing in those cases.
No, diff and merge algorithms are symbol-oriented.
For reasons that are as historical as they are technical, most diff and merge programs choose to break documents up on a line level of granularity in order to produce the symbols that are passed into the algorithm. But that's a design decision, not a technical one.
A diff/merge program that operates at a word level of granularity should be just as capable of handling two words edited on the same line as a line-oriented diff program is of handling two lines edited in the same function.
> The post proposes a single line-break after each clause or sentence, and then a double line-break after each paragraph.
It's a lot easier to rewrite the typographical conventions a piece of software conforms to than it is to rewrite the typographical conventions that millions of humans grew up using. Teaching the diff/merge program to recognize that CRLF isn't the only text boundary out there would achieve the same effect* at much lower cost.
*I realize that abbreviations complicate it somewhat. I'd submit, though, that if a basic diff/merge program is being relied on too closely in a scenario where that actually causes any consequential problems then the real error might be between Mr. Diff User's keyboard and chair. For normal diff usage 'failed' symbol boundary determinations like that are fine, the same as how a traditional line-oriented diff program doesn't critically suffer from the way it would interpret me inserting a carriage return into a line of code.
Point taken, and as soon as someone puts word-by-word merging into svn or git, I'll change my opinion.
> I'd submit, though, that if a basic diff/merge program is being relied on too closely in a scenario where that actually causes any consequential problems
[EDIT removed some response -- maybe "that" referred to a narrower scenario than I thought and parent didn't intend any insult.]
Yes, bad although logical. It's a shame we can't have one optimal convention for all common text that appears naturally in paragraph form. Absent diff and similar programs, lexical units consisting of paragraphs is the obvious choice.
> The post proposes a single line-break after each clause or sentence ...
Yes, which is the convention used in e-mail and elsewhere. But it throws away formatting information that can't be recovered (see below for the reason). Wouldn't it be better to revise diff so that it presents a subset of a paragraph containing the difference text, instead of having to break up the source document just to make diff happy?
Obviously if the text diff processes consists of programming source files, this issue may not be important. But in the general case, text naturally consists of paragraphs, not sentences, and to change text to make diff happy puts the cart before the horse.
The problem with recombining broken text is never more obvious than the case of a paragraph, broken into lines that will need to be merged, followed by a list of items meant to appear as individual lines, that should not be merged. A merge algorithm cannot distinguish the two cases in a deterministic way.
Do you literally mean a list, like LaTeX itemize, description, or enumerate? Then the markup (whatever it is) should indicate that. In plain text that was going to remain in plain text form, I would put list items with double line breaks between them, and leading asterisks. This avoids any ambiguity. I don't know any other scenario where I would want the items to appear as individual lines, but not as paragraphs.
> Wouldn't it be better to revise diff so that it presents a subset of a paragraph containing the difference text
As I already said, some diff algorithms already do that, but as far as I know there are NO merge algorithms that do it. (BTW just in case of terminological confusion: when I say "merge algorithm" I'm referring to version control-style merging of edits; when you say "merge algorithm" you're referring to paragraph-merging, the process of putting multiple lines together into a single paragraph.)
Is it? I haven't seen LaTeX do it wrong yet. Of course, I use actual formatting code to specify format -- what I'd give to someone else is the rendered document, not the LaTeX source.
I use Emacs with auto-fill-mode and Meta-Q to fill paragraphs. This usually works out ok in my own files because usually a re-fill of existing text only affects a few lines.
When other people get involved, diffs and merges are ruined, as the article says.
But the article's solution sounds like a lot of work. Do I have to manually break lines that get longer than 80 characters (or whatever my limit is)? Am I supposed to turn soft word wrap on?
I think the right solution is to rebind Meta-Q in Emacs to some magic command that refuses to reflow any text which is reported as unmodified by the version control, but does reflow new/modified text according to the article's rules, and also imposes an 80-character limit.
Edit [http://stackoverflow.com/questions/539984/how-do-i-get-emacs...] has a lot of solutions for getting Emacs to fill according to the article's suggestions.
In most situations, there should be some sort of semantic break in your sentences every 80 characters or less. (It may not be demarcated with a comma.) If there isn't, you may want to consider reworking those sentences for clarity.
As a result, my text is very ragged right. I don't even notice it because I'm concentrating on meaning, not form.
Any chance the author can post a copy of the documentation his father had saved for the Documenter's Work Bench?
So many UNIX utilities are line-based, paragraphs just complicate things. [line break]
Yet we still type in paragraphs.
You can take the above text and feed it through fmt (one of my favorite utilities) and you get an opening paragraph with two sentences, a single line paragraph, and a final paragraph with two sentences. You can control the line length too. Want 40-column output for better readability? Easy, when using fmt. But you need input that is single lines.
Have you ever ran PDF's through pdftotext or pdftohtml and been frustrated by the formatting? Line breaks from hell.
If documents were distributed in the format the author describes we could convert them into PDF's and other pretty printing formats. But converting from these "paragraphed" formats into readable plain text can be a real nuisance.
'Ventilated Prose' was a term used by Buckminster Fuller. The blog author linked above is drafting rapidly, then adding line breaks after each sentence/clause as an editing aid. He mentions Vi and the use of dif
Found in the comments to the blog post linked above. Just parking these for the inevitable return of this topic.
And of course, the real solution to the problem of "fussing with the lines of each paragraph so that they all end near the right margin" is to use a text editor that soft-wraps lines. Yes, computers have recently become powerful enough to make that possible while editing the document! Amazing.
GitHub's web UI only does a line diff. Which is not particularly helpful when you change a word or two in a six sentence paragraph. It's possible to do a word diff locally of course: `git diff --word-diff` but that's not the general use case for a code host, right now most code is line oriented and the UI suits it well. That is one of the reasons that the article advocates formatting thoughts with newlines, good portability. These input texts are closer to code than prose, so why format it like it was? Splitting thoughts into units digestible by your coding environment has the huge benefit of working with individual thoughts instead of individual paragraphs.