Hacker News new | comments | show | ask | jobs | submit login
Bram Cohen: I have a question for the version control experts (facebook.com)
54 points by lispython 1637 days ago | hide | past | web | 48 comments | favorite

When I left for my friend's for a couple of beers I quickly glanced at his StackOverflow question. It was a copy paste of his facebook post. I thought "Mmm, this is probably getting closed". Turns out I was right :)

Through the magic of the voting system it has now been reopened.

and through the magic of StackExchange, it is now reclosed.

I wonder if I will be able to read this post in 3 years. I'm sure Facebook will have deleted it by then and the link will get dead in 6 month already for sure (first break it, maybe fix it later)

Is archive.org even allowed (by Facebook's robots.txt) to archive this?

It's sad if technical people post important information on Facebook

Edit: facebook's robots.txt has this: User-agent: *, Disallow: /. :-(

Now that Facebook use clearly designed URLs (/bram.cohen/posts/10152387480820183 in this case) I have a lot more faith in them keeping this kind of thing working. It's extremely valuable content for them, especially with their recent emphasis on the timeline and capturing the story of people's entire lives.

One possibility is that Facebook stops existing. Now, I won't speculate what the probability is or isn't - but none the less, it's a possiblity.

On the other hand, there's also possiblities such as; Facebook losing or deleting the content, Bram Cohen deleting the content or his account.

If you try to manually crawl the post with Internet Archive's "liveweb" project (which inserts stuff into Wayback Machine for live lookups) you'll get:

"Page cannot be crawled or displayed due to robots.txt."

So no, the Internet Archive will most likely not archive public Facebook posts - because of Facebooks robots.txt. The IA respects and honours the domains robots.txt.

I've taken a snapshot of the post with my own script which will put it into a IA Wayback Machine friendly format (WARC).

The version control in Squeak, based on Smalltalk-80, has been in use for over 16 years, probably even 32 years. The community of Squeak has been entirely happy with it, it could perhaps be made even more convenient with more diff analysis tools. For every snippet, even a single line in a method, all the versions are kept around and the blame for the revisions to a previous version, including deletions, is always clear. The last version is active (there is no compiling fase in smalltalk, only auto compile for adaptive compilation). So your question is never relevant unless one wants to merge back after a fork. In that case diff tools should give a human the entire history, diffs and all its dependancies that are affected to see what happend when and how. It is indeed useful to see changes going back 40 years. In the case auto-merge-back is possible, the S unit test runner must check if it worked, but such a situation occurs only once in 10 years if at all. So to me there is not much to be desired in a version control system than the current method version control of Squeak.

Beyond the question and reading entirely too much into it, what's the inventor of BitTorrent, out of all things, doing writing version control? Between this and BitTorrent Live, are we going to see something that will blow every model of content distribution, including BitTorrent, out of the water?

If nothing else, torrents with version control would frankly be mind-blowing. Think of a single TV show torrent, which could be updated again and again. Then think of TV broadcasting and content distribution of the program being shown through such an above torrent and an overlay, and ads, being played in on the client-side?

Thinking away the rights issues and DRM, this might be the future of live TV and content distribution.

Bram Cohen did VCS before Linus and Matt Mackall did it:



You are aware that the same man who created winamp also created gnutella, right? Sometimes geeks have different projects, sometimes some of those projects are more subversive than others.

I would imagine it's a totally separate project.

Blame is pointless, and needed only for pathological organizations. The point of these tools is not to assign blame but to find the last working revision of the code. If a commit inherits from multiple commits then there could be multiple working last revisions.

Think of it like sailing when the mast breaks, you don't who last touched the mast, you care how to get the mast back up.

In truth the totality of world events leading up to the moment the mast broke is what lead it to break. This is not important, what is important is how to right it.

If the question is viewed in terms of righting the ship then the point is to identify the last revision that worked, conversely the 'broken' revision is the next one.

Realistically, to solve the problem you'd want a genetic algorithm that could turn on and off each change in each revision to find the combination that yields the most fit iteration of the code.

Blame is most useful after I've figured out what the bug is, and I want to understand why the change was originally made, and to see the other changes that were made to the source tree at the same time. So it's not the author of the commit which I'm interested in blaming, but to see what the reasoning behind the change originally was, by seeing the entire commit which included a particular on a line.

As far as the stack exchange comment that it's useless because what you usually see is white space fixups, (a) that's why I fix up the whitespace before I commit the final patch into the source tree, and/or insist that those problems be fixed before I pull a commit, and (b) this is why a strongly discourage whitespace-only cleanup patches. (Or checkpatch.pl style changes, in general.) With a little discipline, "git blame" can be incredibly useful.

[edit] Whoops, meant to post this one level up.

Blame information is useful, not for pointing fingers, but for reasoning about changes. Often, I'll be browsing code, see something change, blame that line, then go and read that entire revision so I can find out why the line was changed. Normally for me, my goal is to find out the new argument ordering or whatever so I can continue writing my code using the new pattern.

Wait, what? My most common use case for blame is finding out who to ask about some code that I don't fully understand. Often a quick "git blame" and a short conversation can save a lot of time.

Try that on a code base that's 10 years old and the one who wrote the code, including the rest of the core team, probably changed jobs several times already.

Yes, it doesn't always work and is not always feasible.

But there are plenty of occasions where it is, because the changes broke the system. A broken system is not going to sit around for 10 years.

Pointless? You seem too caught up in the meaning of the word 'blame'. Maybe you should think of it as 'annotate', instead.

There are plenty of times where I want to find the changeset that last touched a line of code so I can read its commit message, find an associated bug tracker ticket, etc. What do these uses have to do with the last working revision of a project?


As others have said, it is often very important (at least when working in a collaborative manner on real-world complex code bases) to work out who added code and more importantly why it either exists or was written that way. More likely than not (assuming it's a fairly high-quality code base) there's an edge case somewhere that it's dealing with.

Bisecting to the revision before it broke doesn't tell you anything about this.

Ideally, there should be loads of comments too, but then, that doesn't happen nearly as often as it should in the real world...

I use git blame to find out who to talk to before doing a refactor. Often when I see something as clumsy, there's a good reason for it and it is nice to know who to ask.

Your comment reminded me of Chesterton's Fence http://epicureandealmaker.blogspot.com/2012/03/chesterton-fe... It has been written about a few hundred times in various blogs. Basically, if you don't see a reason, before you make changes think harder until you do.

git config alias.praise blame

Bazaar actually has blame, praise and annotate, all synonyms!

So does Subversion :-)

I like it! Doing it this way will mess up the blame for non-unique lines like

    else if {
but I don't think that matters because you are unlikely to care where such lines come from. (You could fix this anyway by doing unique lines first, then using the blame of those to make better choices for the non-unique lines.)

See also the more complicated variant at the end of this post from the git mailing list:


Blame is always messed up anyway, because there is no right answer. You may as well choose what is quick to compute and looks okay.

Version control systems are not storing what actually changed between files, they are storing the smallest set of differences between them.

IE Given two versions of the same file, and the history graph A->B they store how to reproduce the bits of B from the bits of A.

This is completely unrelated to how B actually got that way. So in turn, they use textual diff algorithms to approximate how B was formed from A.

Even moving back into the text world, there is still no right answer. It only tells you one of the possible ways that A was transformed into B. It would be perfectly valid for the text diff algorithm to say "every line in A was removed, every line in B was added". This in turn would give you a blame that pointed to that rev for everything.

Most textual diff algorithms "try" to do something sensible, but blame is essentially trying to turn applesauce back into apples.

Even git's more "advanced" blame can be completely messed up by the internal text diff doing dumb things.

All that said, one of the reasons you may like it is because it's basically what everyone actually does.

I wonder if, someday, it might be practical to integrate the "micro history" of a change into the overall version control system's understanding of the "macro history". I'm thinking of something like an intelligently-condensed version of the kind of change-replay you see by rolling forward or back through an editor's undo buffer. A system like git would still work the same way, but with this new extra blob of detailed metadata associated with the specific commit.

Fractal Designs Painter (now Corel Painter), back in the day, used to have a feature where the entire creation and edit history of a "painting" could be captured and replayed in full detail, right down to the level of the brush angle and pressure used by the artist with every stroke. This was useful for creating art at low resolution, then re-creating the piece automatically (and much more slowly) at high resolution. Corel's version may still have this feature; I haven't used it in years. But if it was practical ten years ago in manipulating multi-hundred-megabyte files, certainly it would be possible now, for what are usually plain text files.

Instead of just doing a diff or git bisect, imagine being able to load up the commit of a file with one of those ambiguous changes, grab a slider (or your favorite keyboard equivalents), and scrub back and forth through a condensed replay of someone else's actual changes exactly as they were keyed.

I'm not sure this would always be a good thing (a form of surveillance?), but it would certainly be useful when the original author of the code is unavailable.

Version control systems are not storing what actually changed between files, they are storing the smallest set of differences between them.

Many version control systems work that way, but there are other possible strategies. For example, Darcs' patch theory is very much based on recording what actually changed in each revision (and consequently achieves better results in some awkward cases than a purely text-based VCS).

No, actually, Darcs is not, it only looks like it is. Again, unless it's recording key strokes and line changes you actually made in the editor, it's still just approximating what happened with a text diff algorithm that doesn't know what you actually did, and is figuring it out after the fact.

Besides not having the goal of figuring out what changed, they often are heuristic and give up (IE they stop trying to align the original files, and just say "removed here, added there").

No, actually, Darcs is not, it only looks like it is.

I think that's a little unfair. Darcs does store what actually changed between the files, at least at the points they were committed, in a qualitatively different way to the flattening effect of cumulative commits in a system like say Git.

Of course, you can defeat even that approach if your edits from one commit to the next are ambiguous, for example if you have two verbatim copies of something next to each other where you had only one before, and this will subsequently lead to ambiguity if you try to merge a change from someone else to the original copy since the merge has no way to determine whether the first or second duplicate (or both) should be modified.

It sounds like you want something that is directly tied into your editor, so it is aware of changes between commits, or perhaps something that has semantic understanding so that instead of recording half a dozen text edits, it records "variable foo was renamed to bar". Tools that worked on that level would be fantastic to work with, but until you've swapped a text file representation of code for some sort of database-backed semantic model and your edits/refactorings can all be expressed in terms of that model, I don't see how any VCS could possibly achieve it.

It's really not qualitatively different, it's just faster for the operations it provides. It does, however, take great pains to try to display to the user the "right thing". It also does in fact, not lose information once given it. But that actually is not qualitatively different than you could make git or svn or anything else, it's just a more careful implementation.

As for the rest, i don't want anything. I just don't pretend that textual displays of blame/diff/etc are actually showing me a correct history, instead of one possible history of textual changes. If it says bob wrote/changed some code, i don't assume that's really correct (unless of course, the changelog says "wrote code" :P), I ask bob.

I will point out that historically, there were version control systems that were integrated like you describe, even for C++ (IBM Visualage C++). But people are happy with what things like git/svn/etc provide, and that's fine by me.

Remember that I worked on a system whose sole goal was to provide a qualitatively better experience than CVS, so I don't have very high standards :P.

All that said, one of the reasons you may like it is because it's basically what everyone actually does.

Please correct me if I'm misunderstanding, but Bram is proposing to do blame without using the diffs, a nice simplification and not something I'm aware of other version control systems doing.

The commits themselves will still, presumably, be generated by diffing. After that, sure, it should be fairly easy to compute exactly what commit contributed each character of a file, and when.

No. The simplest example I can think of is:

  Revision 1: x

  Revision 2: xx
Which x was added in revision 2?

In real life, you get things like replacements of a statement by an if-else block with statements similar to, but not identical to, the original statement in each branch. Are there still parts of the original statement there? If so, in which branch(es)?

Also, suppose I add code in version 6, you remove it in version 14 and someone else resurrects it in version 23. What commit contributed that code? How can you know whether that someone else resurrected the code, rather than write a new copy?

Finally, what is 'contribute'? Does a commit that 'only' moves lines around in online help contribute to a file? What does it contribute? How do you show that contribution in a diff? What if the move also necessitated some minor changes like adding punctuation?

No, you can't do blame without the diffs. Bram is proposing not attempting to figure out the actual lineage of a line, but instead, when there are multiple possible answers, pick one.

I like the Quora/cats comment.

I also liked the response: the claim that StackOverflow would be a better forum than Quora for this kind of question; StackOverflow closed the question as "non-constructive", as they specialize in "concrete questions with factual answers"... honestly, Quora seems like the perfect place for this kind of thing.

Off topic, but I'm glad seeing people use facebook as a sort of blogging platform. If written in notes, I'm quite sure even photos could be embedded.

Your RSS reader would simply merge with your news feed.

But it's not interoperable elsewhere. This would would work for others on Facebook, and wouldn't work if one were to leave Facebook. In other words, if everyone decided to use Facebook as their blogging platform, then their entire audience would be locked-in to Facebook.

Their unstated goal is to become The Internet itself. They want to curate content, only show people what they want, and have you never leave their site.

Do not trust.

I'll be happy instead to see content posted in places without barriers to viewing. Have no Facebook account, will not create a Facebook account and drink that Kool-Aid just to see what might be something very compelling. Post it elsewhere and see how much more inclusive and open it feels.

Even if Facebook somehow managed to deploy Wordpress or wiki quality content systems, I still won't go there.

My <thing I haven't used in three years> would merge with my <thing I've never had, seen, or would want to use>?

Why? How is that better than using WordPress and RSS?

If you use Facebook as your blogging platform, the dollars generated by showing ads to your audience go to Zuckerberg.

If you host your own Wordpress install, the ad dollars go to you.

If you're Mark Zuckerberg, bloggers blogging on Facebook is better than bloggers blogging elsewhere. That's the circumstance under which blogging on Facebook is better.

While I tend to agree with this exact comment (since it's Facebook specifically), I don't agree in spirit because administering your own Wordpress install takes time. That time might be valuable for some.

I'd certainly run my own Wordpress, but not everybody wants to.

You're right, it's really not.

I suppose I just connect more so with a facebook profile than with a traditional blog.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact