Hacker News new | past | comments | ask | show | jobs | submit login
History of version control - 10 astonishments (flourish.org)
228 points by frabcus on Dec 17, 2011 | hide | past | favorite | 73 comments

I would say it's missing 7.5, 1996 public anonymous cvs. Before that, you were either one of the developers with a login or you waited until a source tarball was released. Which makes it really hard for outsiders to contribute because they're aiming for a moving target they can't see.

Yes, absolutely. CVS, for all its faults, changed the way we think about source control. It became a public record, not just a tool. And what's funny is I don't think any of us realized it at the time; we were all too hung up over how weird lockless modification was and learning to do merge failure resolution.

That was one of the major hallmarks, I seem to recall, of OpenBSD - though we all remember it for security right now.

Yes, anoncvs was developed specifically to make openbsd open.

Some version of #2 might actually not be a horrible idea; companies seem fairly adept at losing old source code, and physical archives run by competent librarians are a little bit more durable and organized. Reminds me of a recent story from engineering (http://wrttn.in/04af1a), but there are plenty in tech as well, where e.g. someone resorts to emailing a former contractor asking if they have a copy of the source code they wrote on their contract job three years ago, because the company has somehow lost it.

I can relate to your link, since I now have a number of hard-drives with a ton of stuff on them that has been copied back and forth and mixed up so many times I am not quite sure what is what and which directories are complete (I have a small suspicion that none of them are). I can't imagine how a large company can possibly figure this out.

But your solution isn't going to help much. In twenty years, do we even have a computer that is capable of reading the old floppy disks? Most computers these days don't come with a floppy drive and the new mac mini doesn't even have a CD drive. Sure USB may be around, but twenty years ago you would have said the same thing about the 3.5inch floppy.

Really if you want to store source code like that, you would have to print it physical paper and store it in a massive archive. And how would you handle changes?

Do you want to print everything each time you do a svn commit? Or just the diff (yeah, that is going to be fun to type in)?

A central server, properly organized and upgraded would properly be the best, but even so it is never going to be very good. In a world were the price of data is very close to nothing, good metadata seems increasingly expensive.

If you have a dedicated archive with dedicated librarians managing it, they can be in charge of migrating the archived data forward whenever a particular storage technology threatens to become obsolete.

And they can illuminate the cover titles in calligraphy with little illustrations of cherubs and such.


Love that story! Thanks for linking to it.

Just as hard with software, whatever version control system is used...

Who is going to remember the culture of old-but-still-running Ruby on Rails apps in 30 years time, when even Node.js isn't fashionable any more ;)

It misses out BitKeeper, which inspired both Git and Mercurial, which was launched 5 years earlier.

It gives very short shrift to commercial source control systems generally. Two (VSS, and SCCS --- the latter part of the original AT&T Unix distributions, on the same closed-source terms) get mentioned in passing, but neither gets treated as a milestone.

That would include BitKeeper --- copies were available for a few years at no cost to Linux kernel developers, but only on increasingly restrictive terms, which McVoy ultimately wound up revoking altogether. IIRC, Linus released the first embryonic version of git within weeks after McVoy withdrew the free-as-in-beer version of BitKeeper.

In terms of historical interest, it's also worth noting why McVoy restricted and eventually withdrew the BitKeeper license: Andrew Tridgell (of Samba and rsync) had begun reverse-engineering the protocol with the eventual intent of implementing a free client.

The resulting debacle was fairly ridiculous; Linux chastising Andrew Tridgell, continued flamewars about using a closed-source product.

What I find interesting here, though, is just how much hot water McVoy landed in. He gave away free licenses to Linux developers, then when someone in that community started reverse engineering his product with the intent to replace it, he revoked the free license, leading Linus to develop a replacement anyway -- one that has largely consumed the vast majority of BitKeeper's target market.

Tridge's reverse engineering was why McVoy withdrew the license, but he'd started to restrict it well before that. Initial releases were source-available, and had a "fail-safe" that would turn the code fully open source if McVoy's company ceased to function. Source access and the failsafe were withdrawn over time, and various restrictive clauses were added. (Most notably, around 2002, McVoy added a "non-compete" clause which purposted to restrict any user for gratis BK from working on a competing SCM for a full year after they last touched BK. I'm not sure that was ever tested in court.) Here's a brief description of the history:


Andrew Tridgell recreated the "reverse engineering" you mentioned in front of a live audience at linux.conf.au not long afterward; see https://lwn.net/Articles/132938/ . Summary: he saw a port number in the standard bitkeeper URLs, tried telnetting to it, tried typing "help" which listed the available commands, and tried typing the "clone" command which spit SCCS files back at him. He walked through the process and literally had the audience shouting the appropriate commands at him the whole way through, demonstrating the obviousness of the process.

Well if I recall correctly his reverse "engineering" consisted of figuring out that the protocol was a plain-text English based protocol (i.e it would be like if you were to reverse engineer FTP by looking at the TCP stream).

Misses the absolutely massive Clearcase. In 1999, (and earlier, but I ran into in 1999 at Loudcloud) - if you wanted to support multiple branches and allow merging code into them, it was the only tool that made it easy. Had great (windows) client side environment that gave everyone a "view" into the source, But _man_ was the backend ugly.

ClearCase is certainly popular, but I don't think it ever had a feature that was a big milestone in version control history. Client side software with a view of the source is already covered in the author's point #4, and ClearCase doesn't do this better than the others (quite the contrary, in my experience).

Clearcase had two new additions: Branching/merging made easy(ier?) and the source code under version control was seen as a file system, a drive letter on your windows system.

I'd agree with that, but the article seems focused on a linear progression, where clearcase is more of a fork that nobody else followed.

It also completely ignores earlier distributed source control projects like GNU Arch. 2005 was when there was an open source distributed VCS that was fast and pleasant to use, but implementations of the idea are older than that.

Darcs merits a nod, as well.

Misses also GOOG's dearly Perforce.

This article isn't meant to be a comprehensive list of all SCM systems or even all of the important ones. It's just a list of all of the big technological advancements. Perforce is a good system but it was never revolutionary.

There is nothing particularly astonishing about perforce. What makes you think it should have been included?

If you ignore all the distributed "stuff", and the workflow enhancements it permits, and assume everybody is always connected to the server and attached via a LAN, you can just concentrate on doing a reasonably good job of handling very large quantities of data, including very large binary files. (Apologies for not trying to reproduce the breathless style of headline.) As is common, the article presupposes that decentralization is unambiguously progress, but that isn't true in all respects.

People often complain about the idea of using version control for large binary files, as if it is unreasonable to want such a thing, and that as a point of principle version control systems should contain only text files, and the fact that many version control systems support this poorly is proof that you don't want it anyway. But there are actually people who create, with their own hands, large binary files, often of the completely-unmergeable variety, and they deserve version control just as much as the programmers do.

(And then once you have a system that works well for them, you can then use it to solve all manner of problems that might previously have involved storing files in public folders, mailing them round, or maybe just waiting for them to compile again. No need for any of that crap any more - just check the files in, they're there forever, and you can get them back quickly.)

> As is common, the article presupposes that decentralization is unambiguously progress, but that isn't true in all respects.

It is, in the sense that a decentralized VCS is, essentially, a superset of a centralized one.

Blobs are certainly still an issue, though orthogonal to distribution (I don't think you intended to imply it was related, but it could be read as if you did).

Well, I don't mean to imply that binary files are inherently impossible to handle using a decentralized system. In fact, I have some PDFs and PNGs in my git repository, and git has managed not to make a mess of them. But I still think binary files are difficult for distributed systems to support well.

Distributed systems rely on allowing people to (in effect) create multiple versions of the same file, and then merge them all together later. But it's very rare that binary files are mergeable! And if the file can't be merged, the distributed approach won't work. People will step on one another's changes by accident, and people will have to redo work.

The usual solution is simply not to allow multiple versions to exist: enforce some kind of locking system, so that each editor has to commit their changes before the next one can have a go. But now you need some centralized place to store the locking information...

I never had a problem using cvs for what I considered large binaries. Certainly we kept using cvs long after mostly switching to bk because cvs worked better for binaries. Perforce never seemed like a big deal, just cvs with a little more.

Due to CVS's (or RCS's?) text-based file format and conversion of line endings when checking out a repository, migrating binary files across operating systems can cause mangling of bytes whose value equals that of CR and LF.

Google's Perforce?

I read the "GOOG's dearly" part as "dear to Google" not "developed by Google". Google did use Perforce in the past and they still do, probably.

I think that the GP used the phrase correctly, but not being a native English speaker myself, I'm not entirely sure.

I believe perforce handles large binaries as well.

Monotone gave most inspiration according to the authors of git and hg.

But the author addresses the point anyway: > I’m not recording the first time anyone made the astonishing thing, but the first time it was productised and became popular.

Torvalds was expressly trying to replace bitkeeper when he wrote git. And I'm not sure anything he said about other systems might be considered "inspiration".

But that said, git really isn't very similar to bitkeeper except insofar as it enables distributed development. Both model development as a forest of independent developer trees which communicate with each other through merges. But bitkeeper is still a traditional centralized server keeping a bunch of delta'd files.

According to Wikipedia[1], he liked some of its features, and seemed to float it as a model for a replacement.

[1] http://en.wikipedia.org/wiki/Monotone_(software)#Monotone_as...

bitkeeper only has a central server to the same extent that git does. Somebody on the team picks a repo and says "this one is for serious" and that's it.

The decision bitkeeper made to keep files in SCCS format was of course not revolutionary, but tells you quite a bit about the target market (people who had makefiles that relied on implicit commands like get just working). They went to extra effort to make it look just a bunch of delta'd files.

It also misses NSElite (later Teamware), which, in Larry McVoy's words, was BitKeeper's grandparent.

There are actually a number of historians looking at version/source control (although there still remains a dearth of study, given the importance of the issue). Michael Mahoney, Michael Cusumano, and N.L. Esmenger are three of the more important. For a contextualization of the history, see my short presentation (at UCLA): http://www.iqdupont.com/networked-modes-of-production/

Scholarship of this kind is important as it gathers secondary references for future works of synthesis as well as preventing the repetition of mistakes!

Is anyone doing a decent work of synthesis for the history of computing in general? Something like Judt/Postwar or TARUSKIN/History of Western Music?

Ugh, SourceSafe. I worked for a company that used it. They had terrible intermittent file corruption issues. Long story short, we tracked down the root cause -- an Ethernet cable wrapped too tightly around a power supply brick, which caused network errors over SMB which led to file corruption. Ugh.

> we tracked down the root cause -- an Ethernet cable wrapped too tightly around a power supply brick

I just became a little lightheaded, and my vision has gone all blurry and grey.

I hope you're claiming some sort of disability compensation from that company over the long term harm this must have caused you.

Back in the late 1990s, my team's SourceSafe repository would get corrupted weekly. Like your experience, using SMB over a slow network seemed to have a 50% chance of corrupting the repo. SourceSafe was worse than no source control...

Hehe. SourceSafe is still the standard at the state govt I work for... high tech for, say, mid 1990s, which makes it downright fancy for us.

I used SourceSafe for the better part of 10 years. The underpinning technology very much resembled RCS.

But all-in-all, it was quite serviceable for a 75,000 line C++ project. Just don't try to do branches.

That's a little bit like saying "All-in-all, this car is quite serviceable. Just don't try to use third gear."

Well, that made me chuckle, and I can't disagree. But to the other comment's post, it never lost anything for us.

Thank goodness we didn't need to do any branching.

So perhaps we can think of RCS as first gear, VS as second gear, CVS/SVN as third gear, and git as fourth gear?

Is there any vcs that would be better than any other when it comes to a serious EMI issue? And, who wraps an ethernet cable around a switch mode power supply brick?

Yes, almost any other VCS would fare better. VSS was set up so that clients could write files directly into the repository -- if the client failed for whatever reason, the repository file could get corrupted.

The bad memories are coming back. IIRC you would mount a network drive containing the VSS files. One company I worked for had a virus which infected the repository and I don't think it was ever completely cleaned out.

And salvaging a corrupted repository was a pain because SourceSafe stored your files in a munged format spread across dozens of files named aaaaaaaa, aaaaaaab, aaaaaaac, and so on. :(

It is interesting to me that Subversion barely gets a mention. There should be a 7.5 which is along the lines of:

cvs was great and all, but we couldn't version our directories, branching and merging was a mess, the wire protocol was hard to use, it had a ton of security holes and the storage format took up too much space. So, Subversion was created as a way to do a better cvs, without thinking about the larger intrinsic issues with the current state of version control. Thus, missing the whole 'distributed' boat and letting Linus eat our lunch.

Note: I worked at CollabNet during this time and watched a lot of the discussions around Subversion. I have great respect for the Subversion developers. Karl is an awesome and brilliant guy. It was a bubble, technology isn't anywhere where it is today and I think we were all misguided at that time. We all made a lot of mistakes.

The article is about astonishment. Everything SVN did was fairly obvious.

Ha! Definitely can't argue that point. ;-)

With Subversion branching and merging was still a mess, at least everywhere I ever saw it used.

Oh, it is a complete failure. I've switched to Git entirely as a result of it and will never use Subversion again if I can help it.

I wrote a long'ish blog post detailing why it is a failure:


It makes me wonder, what is next? What new astonishing thing will happen in version control?

I think what's needed is an intelligent (as in AI) merge mechanism. Right now, if two people are adding two different features to a set of files, then merging those changes is error-prone and requires a lot of manual work.

If this ever gets perfected and automated, it will be a huge milestone.

This should be much simpler once we stop using those silly text files and start storing everything as a proper representation of AST. Then again, at that point we can get rid of the silly text-diff-based systems and just store everything as versioned trees.

I actually wrote a very simple system like this for a hackathon several months ago. The idea is that we would take some basic Scheme code (boy did we aim high :), parse it and commit the result. We would then diff the trees and keep track of the changes that way. Finally we had a cute web front-end that pretty printed the code from the AST and could show the diffs visually.

We got the basics working, including simple diffs. One goal was to link the same variables between two versions; we did not manage to make that work, but had a very hacky approach that looked like it worked.

Doing any sort of merging with this data is nontrivial. We were planning to implement it, but unfortunately ran out of time. Still, we did have a cute demo of some commits and some diffs in the end--it actually worked a little, which is much more than I expected starting out.

However, despite not implementing merging, we did throw in some nice features. Particularly, we were able to identify commits that did not change the function of the code (whitespace and comment changes only) and mark them. This was very easy but yet still useful, and a good indicator of the sorts of things one could do with a system like that.

After the hackathon, one of my friends found some papers about a system just like ours. I don't remember where they were from, but if you're interested you could look for them. (I think the phrase "semantic version control" is good for Googling; that's what we called our project.)

Overall I think that it's a neat domain but in hindsight maybe it was a little too much for 18 hours of coding :) We did have fun, and it was cool, so I have no regrets.

This sounds like a bad idea. Code is text. If ast helps merging diffs, why not use it for analysis in case of conflict and keep code as text?

I think it's most accurate to say that code is a textual representation of an AST. Saying that it's just text is just like saying it's just a bunch of numbers--both technically true but missing the bigger picture.

One potential reason no to store code as text is that there are many equivalent programs that differ only in inconsequential text. A perfect example is trailing whitespace.

There are also some benefits of storing code as an AST. For one, it would make it trivial to identify commits that did not change the actual code--things like updated comments. This would help you filter out commits when looking for bugs. Another benefit would be better organized historical data: in a perfect system, you would be able to look at the progress of a function even if it got renamed part of the way through.

But then you end up with a version control system that is not generic, but dependent on a particular language. The story of Smalltalk suggests that the added value might not be worth the coupling and complexity it requires.

You should be able to write a generic version control system like this where you can just plug the appropriate parser in and it would work for that language. For backup, you could have it still keep some files as text.

Because you can always convert from ast to text, but not always from text to ast. Also it's easier and faster to convert ast->text, and you can do it when needed only. Additionally you'd never commit a syntax error. Why is it a bad idea?

Now it's time to bring the idea beyond code

Exactly! Try versioning data with ChronicDB:


I don't know what the date would be, but I think somewhere before "you can keep lots of versions in one file" should go "you can keep lots of versions of one file" -- a versioning file system. Not sure when this was introduced; Wikipedia thinks it may have been in ITS.

I used to work on a large Lisp system where our entire source control system was provided by Emacs versions and locking on a central NFS server, with some explicit branching support in the build code, and with version freezing done by copying directories. I can hear you gagging, dear reader, but actually it didn't work that badly, except that it didn't handle distributed development.

2. Humans can manually keep track of versions of code! (1960s)

As everything, to begin with there was no software.

    “At my first job, we had a Source Control department. When you had your code ready to go, you took your floppy disks to the nice ladies in Source Control, they would take your disks, duly update the library, and build the customer-ready product from the officially reposed source.” (Miles Duke)
Balderdash! Floppies were first available in 1971. They had decks in the sixties. They had paper tape. But they didn't have floppies! So, using the terminology of decks, the author scores a validity check!

Yes, well spotted - I had trouble dating that comment, as I assumed the reference to changing to RCS just after it was about 1972.

You're right that it talks about floppies, so it can't have been.

Enjoying the few more comments the article has generated with memories of those earlier days. Would love to see the earlier astonishments written up more precisely.

It's easy to forget that we are living in a golden age of version control systems. The market is rife with many fairly decent commercial systems and some of the best, state-of-the-art systems are completely free.

Sorry to not share your angelism. git seems worshipped here. I never used it but I used cvs and mercurial a lot, and we are very far from what a real,versioning should be: completely transparent.

cd dir should propose to update it.

save file should commit it and push it in tmp branch.

I do wonder about transparent persistence & versioning - bring on the unlimited undo, independent of editor processes. This would require a better interface for navigating the history than undo/redo or git log. Time Machine is an exploration in this direction.

"Save" can tag a state as interesting.

But please don't conflate "cd" and "update" - I rarely cd (emacs), but I really don't want to grab partial changes from other branches just because I'm working in a directory. Notification that a file has been changed in other commits would be fine, but "a directory" is a poor heuristic for unit-of-change.

I think you might be letting perfect be the enemy of good. Certainly there's a lot of improvement remaining for version control. Ideally everything (files, database records, etc.) should be versioned automatically, but that shouldn't stop us from appreciating the good state we are in today compared to the dark years when version control for anything was difficult or impossible.

Well, I used three version control software: cvs, svn, mercurial. While I agree that the latest is better, I feel it is a bit of an exageration to say that previous to git/mercurial, versioning was difficult or impossible. Many people say git and github are the best invention in the latest years, but I don't see that it had really changed the industry compared to good ol' cvs.

No. That's an online backup of your system. A version control should clearly mark versions you'd ever like to go back to or merge; "save" is not enough for that (many IDEs save every time you compile, and most of the time that's definitely not a version I would like to refer to in the future)

Too much noise/versions is ALSO bad. I'm quite happy with having to tell git every hour or so "this is a version I'd like to go back to"; and continuous zfs snapshots when I want to undo an unplanned delete of the last 20 minutes of work and immediate IDE crash (otherwise, ctrl-z is just as good).

Right tool for the right job.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact