
The Tyranny of the Diff - baha_man
http://michaelfeathers.typepad.com/michael_feathers_blog/2012/04/the-tyranny-of-the-diff.html
======
Chris_Newton
Isn't this a fundamental difficulty with any flat text-file representation for
code, though? Many programming languages don't lend themselves to easy
semantic analysis based on even a single file, never mind a combination of
files. We usually don't work with canonical textual representations of even
quite simple programming ideas such as the example in the blog post of
replacing a nested condition with a guard clause.

At our current level of expressiveness, the refactoring tools that make these
semantic changes mostly aren't a huge step up from brute force text editing,
and the mechanical changes we can automate are usually trivial compared to the
way a developer might conceive a required change in the behaviour of his code.
Trying to reverse the process to identify the semantic significance of changes
after the fact is far beyond us today.

In fact, it's hard to see how we could ever move beyond that level without
developing some new and much more semantically rich way to represent our
programs. At that point, the idea of diffing raw text code side-by-side might
seem like it comes from the dark ages anyway...

~~~
marshray
There have been a thousand graduate school and commercial project to try to
develop program representations and languages that were superior to the flat
text file. Obviously, none of them have received wide usage.

What has succeeded for many is scriptable text editors, cscope-like tools,
autocomplete-like features, and refactoring support in editors. But as you
point out, these are better tools to work with program text.

I think text is safe for the forseeable future. Partly because we have
thousands of years experience with it and partly because it's a winning
combination of a extremely powerful representation format and a KISS solution.

~~~
stcredzero
_There have been a thousand graduate school and commercial project to try to
develop program representations and languages that were superior to the flat
text file. Obviously, none of them have received wide usage._

Chunk format Smalltalk change logs are a text file, but they're combined with
a mechanism for treating code changes like db transaction logs, in a language
with very little syntax, with very fine grained codebase change (method level,
with generally small methods) built into the language and environment. It's
also as rock solid as traditional flat text files, and even more robust in
some regards.

Code syntax which requires all information relevant in a class to be defined
together interferes with a change log scheme. But such syntactic structures
are helpful if dealing with code in traditional flat files. In a way, it's
like an Evolutionary Stable Strategy. It's mostly a flat file world, so the
languages are mostly implemented with that in mind. This general situation
makes it almost impossible for anything else to develop its won ecosystem.

I think forward progress will be made by leveraging "code folding" schemes.
Eventually, this will amount to the same functionality people are aiming for
then they discuss non-text representations.

~~~
stcredzero
Arrgh: when, not then. Thanks again iPad!

------
zdw
Diff works best when things are single idea per line, and when control
structures don't get in the way.

One example - the use of the trinary (?:) operator as a replacement for
if/else assignment statement can quick and easy when programming. The problem
is that diffs with it can look like a total mess because a whole lot of things
are happening in that one line.

Similarly, certain languages where program flow or control structures make it
so that people are inclined to make many things happen in one line (inline
regex, lisp or scheme syntax lanagues) can diff in a confusing manner.

Diff is great when everyone is using a coding/whitespace standard, and things
tend to atomically happen on one line.

I'd encourage that when refactoring code that's not to the standard, you do
two passes - one to clean up the code to the standard, then another to make
the actual changes.

~~~
gbog
Yes, and commit on every smallest atomic refactorings.

~~~
stcredzero
I've used environments with quick and reliable Undo/Redo stacks. To be
completely compatible, one would have to treat code as data in a
"nondestructive" editing scheme and save off the actual refactoring steps,
which can then be committed automatically.

Or maybe, a "quick snapshot" facility could be developed to make it easier to
save off intermediate steps and commit the series of them automatically.

------
knowtheory
The point is fair enough, but to call this a tyranny seems a bit much.

If one believes (as say that Literate Code movement does) that code alone
isn't sufficient to convey an author's intent, why should diffs alone, which
mark changes in a codebase, be any different?

Comprehensible commit messages and comments explaining the purpose of a method
itself (and perhaps the mechanism around which a block of code functions if
necessary) are no vice and go a long way to mitigate or counteract any such
"tyranny".

------
dochtman
Split up your changes.

Seriously: in most cases it's possible to split up a 30-line patch into 5
separate patches that, applied in order, monotonically improve your code base
and are easier to review, in total, than one larger patch. Changesets are
cheap, and we should be optimizing them for easy review, so any eyeball we get
can see what's going on.

~~~
morsch
When I'm ready to commit, I usually have solved the problem in code. In order
to create five separate patches that illustrate the line of thinking, wouldn't
I have to go back in time and re-create the intermediary steps?

I suppose I could create patches as I am actively solving the problem, but at
that point my code may very well be a mess that I'd have to clean up each
time.

Of course all of that is moot if your problem naturally segments into several
patches, if nothing else than simply by the virtue of being larger than a
30-line patch or involving several mostly independent components.

~~~
cpeterso
If you've completed a big change and can't stage it in separate commits that
build upon each other ("telling a story" of the feature development), I
recommend at least splitting non-overlapping chunks that can stand
(compile/test) independently into their own commits.

git-cola is a good GUI to visually stage chunks into separate commits. I
haven't really used git-cola's other functionality, but I really like its
visual staging features.

<http://git-cola.github.com/screenshots.html>

~~~
drothlis
git-gui (part of the core git distribution) provides similar mousey-clicky
staging -- you can stage hunks or individual lines at a time.

------
wnoise
A lot of people have mentioned ugly diffs when lines are considered the basic
unit of granularity, often giving examples that are much cleaner if words are
considered fundamental (e.g. latex, or lists of files in makefiles). But there
are many tools that can handle word-based diffs, and most have options to
change what's considered a word.

git diff can use --word-diff(=color) and --word-diff-regex=...

There is also the venerable "wdiff" program.

I've only found one program that can do "word patches" though: "wiggle"
<http://freecode.com/projects/wiggle> . It works well, though I've found the
interface to be slightly confusing. But turning it into a git diff and merge
driver isn't that hard.

~~~
michaelfeathers
I just want diffs at method scope. Show me the methods that have been added,
changed, or deleted. Use line diffs for things external to methods.

------
bricestacey
If you use vim, try the fugitive plugin[1]. You can do :Gdiff to see side-by-
side diff between what is staged and the working tree. It provides some
context, but if you need more you can unfold it.

[1] <https://github.com/tpope/vim-fugitive>

~~~
sophacles
More generically, many diff viewers exist, and certainly help understand what
is going on with a diff, in a side by side comparison. However, in a multi-
file refactoring, sometimes this can still be a bit tricky. Say for instance
you split one method into 3 smaller methods that are somewhat inter-dependent
(you have x(), but now it is a(), b(), c(), and you sometimes call a(); c()
and sometimes b(); c(), and sometimes just c()); in that case you have a
complex diff that is hard to make sure everything is correct in. Even if you
can turn it into a series of changes keeping x() as a wrapper around a(), b()
and c(), you'll frequently end up with several multi-file diffs to look at as
you migrate.

I somewhat agree with the author, that maybe there is some better change
viewing paradigm we aren't seeing for these complex cases, that could benefit
everyone.

------
pwpwp
One of the most interesting things re diffing/VC I've seen in a while is
"Towards Structural Version Control"
<https://www.cs.indiana.edu/~yw21/slides/ydiff-slides.pdf>

------
naner
_I realized that I didn't want to look at the diff any more. I just wanted to
see the full body of the affected method before and after my changes. Oh, and
I also wanted to see whether I added new tests or changed existing ones._

Well there are multiple ways to do that. This doesn't really have anything to
do with the diff but how you chose to view it.

~~~
sliverstorm
Absolutely, I was just thinking that. Many version control systems support an
arbitrary diff command, e.g. via environment variables. In that case, one
might be able to try tkdiff, for example.

------
dmlorenzetti
I often stage changes with an eye toward making the diffs clearer -- not for
myself at commit-time, but for myself and colleagues in the future.

------
TazeTSchnitzel
That would be an interesting feature to add to diffs - if there are huge
structural changes in a block, just show before and after instead of trying to
show differences.

No idea how you would define the criteria or implement it though.

~~~
_delirium
I'd like an option to collapse additions or removals of entire functions into
just one line saying "foo() added" or "foo() removed", instead of the 50 lines
of the function's body scrolling past. However that does have to be language
specific.

------
makecheck
Most languages and file formats don't _enforce_ an inherently "undiffable"
layout, so it's really an issue of programming style. People tend to lazily do
what's easiest to _write_ instead of thinking about what will be easiest to
_read_ later (in "diff" or otherwise).

A common example I see is something like a one-line list of files to build in
a makefile. It is certainly _possible_ to put every file on its own line and
backslash-escape each line ending, and doing so produces a very readable
"diff": if someone adds a file you see "+ xyz.c" (or whatever) _and that's it_
instead of a mangled mess of file lists repeated with one word that's
different.

------
ExpiredLink
Simple suggestion: Don't change the old code. Copy the old method and then
rename the old method. Refactor the copied code. Diff will only mark your
'newly added' code which will check in without conflicts. Later remove the old
method.

------
_cavalle
For the reasons described in this post I tend to separate in different commits
those changes that alter functionality from those that are refactorings.

Sometimes I refactor before making my changes, and sometimes I do it
afterwards. In any case I try not to mix a change in the functionality and
some refactoring in the same commit. That way, in retrospect, it's easier to
me to understand each commit: the ones related to changes in functionality
have simple, easy to understand diffs, and the ones related to refactorings
have messy diffs but at least I know that they don't change any functionality.

------
zvrba
I have the same problem when maintaining Latex files in some RCS. After any
"big" change (rewritten sentence, etc), I customarily reformat the text in
emacs so that it looks nice on screen, which also totally messes up the diff.

emacs has a mode which allows one "logical" line to wrap and to be edited as
many "physical" lines, but when I tested it few years ago, it was rather
broken. Fortunately, for editing Latex and such, I don't really need the
diffs, I'm just interested in archival.

~~~
drunkpotato
Emacs virtual-line-mode has gotten much better; check it out again if you're
interested.

What I do with my Latex files is have one sentence per (logical) line. I've
found diffs at the sentence level much more helpful.

------
pnathan
I like using a UI to do large diffs in a side-by-side fashion.

------
keypusher
Use meld, or any one of the many other excellent visual diff viewers.

------
slurgfest
The problem isn't that the diff has "tyranny". The problem is that what you
are diffing is bytes instead of the program's AST.

------
renata
git add --patch can also help with this. You don't necessarily need to make
all your changes simultaneously.

