RCS is patch based, the most recent version is kept in clear text and the previous version is stored as a reverse patch and so on back to the first version. So getting the most recent version could be fast (it isn't) but the farther back you go in history the more time it takes. And branches are even worse, you have to patch backwards to the branch point and then forwards to the tip of the branch.
SCCS is a "weave". The time to get the tip is the same as the time to get the first version or any version. The file format looks like
^AI 1
this is the first line in the first version.
^AE 1
That's "insert in version 1" data "end of insert for version one".
Now lets say you added another line in version 2:
^AI 1
this is the first line in the first version.
^AE 1
^AI 2
this is the line that was added in the second version
^AE 2
So how do you get a particular version? You build up the set of versions that are in that version. In version 1, that's just "1", in version 2, that's "1, 2". So if you wanted to get version 1 you sweep through the file and print anything that's in your set. So you print the first line, get to the ^AI 2 and look to see if that's in your set, it isn't, so you skip until you get to the ^AE 2.
So any version is the same time. And that time is fast, the largest file in our source base is slib.c, 18K lines, checks out in 20 milliseconds.
I had... much too extensive experience both with SCCS weaves and with hacking them way back in the day; I even wrote something which sounds very like your smoosh, only I called it 'fuse'. However, I wrote 'fuse' as a side-effect of something else, 'fission', which split a shorter history out of an SCCS file by wholesale discarding of irrelevant, er, strands and of the history relating to them. I did this because the weave is utterly terrible as soon as you start recording anything which isn't plain text or which has many changes in each version, and we were recording multimegabyte binary files in it by uuencoding them first (yes, I know, the decision was made way above my pay grade by people who had no idea how terrible an idea it was).
Where RCS or indeed git would have handled this reasonably well (indeed the xdelta used for git packfiles would have eaten it for lunch with no trouble), in SCCS, or anything weave-based, it was an utter disaster. Every checkin doubled the number of weaves in the file, an exponential growth without end which soon led to multigigabyte files which xdelta could have represented as megabytes at most. Every one-byte addition or removal doubled up everything from that point on.
And here's where the terribleness of the 'every version takes the same time' decision becomes clear. In a version control system, you want the history of later versions (or of tips of branches) overwhelmingly often: anything that optimizes access time for things elsewhere in the history at the expense of this is the wrong decision.
When I left, years before someone more courageous than me transitioned the whole appalling mess to git, our largest file was 14GiB and took more than half an hour to check out.
The SCCS weave is terrible. (It's exactly as good a format as you'd expect for the time, since it is essentially an ed script with different characters. It was a sensible decision for back then, but we really should put the bloody thing out of its misery, and ours.)
Yeah. I suspect the answer is 'store all binary data in BAM', which then uses some different encoding for the binary stuff -- but that then makes my gittish soul wonder why not just use that encoding for everything. (It works for git packfiles... though 'git gc' on large repos is a total memory and CPU hog, one presumes that whatever delta encoding BAM uses is not.)
We support the uuencode horror for compat (and for smaller binaries that don't change) but the answer for binaries is BAM, there is no data in the weave for BAM files.
I don't agree that the weave is horrible, it's fantastic for text. Try git blame on a file in a repo with a lot of history then try the same thing in BK. Orders and orders of magnitude faster.
And go understand smerge.c and the weave lightbulb will come on.
Yeah, that's the problem; it's optimizing for the wrong thing. It speeds up blame at the expense of absolutely every other operation you ever need to carry out; the only thing which avoids reading (or, for checkins, writing) the whole file is a simple log. Blame is a relatively rare operation: its needs should not dominate the representation.
The fact that the largest file you mention is frankly tiny shows why your performance was good: we had ~50,000 line text files (yeah, I know, damn copy-and-paste coders) with a thousand-odd revisions and a resulting SCCS filesize exceeding three million lines, and every one of those lines had to be read on every checkout: dozens to hundreds of megabytes, and of course the cache would hardly ever be hot where that much data was concerned, so it all had to come off the disk and/or across NFS, taking tens of seconds or more in many cases. RCS could have avoided reading all but 50,000 of them in the common case of checkouts of most recent changes. (git would have reduced read volume even more because although it is deltified the chains are of finite length, unlike the weave, and all the data is compressed.)
I'd do that if I was still working there. I can probably still get hold of a horror case but it'll take negotiation :)
(And yes, optimizing merge matters too, indeed it was a huge part of git's raison d'etre -- but, again, one usually merges with the stuff at the tip of tree: merging against something you did five years ago is rare, even if it's at a branch tip, and even rarer otherwise. Having to rewrite all the unmodified ancient stuff in the weave merely because of a merge at the tip seems wrong.)
(Now I'm tempted to go and import the Linux kernel or all of the GCC SVN repo into SCCS just to see how big the largest weave is. I have clearly gone insane from the summer heat. Stop me before I ci again!)
Our busiest file is 400K checked out and about 1MB for the history file lz4 compressed. Uncompressed is 2.2M and the weave is 1.7M of that.
Doesn't seem bad to me. The weave is big for binaries, we imported 20 years of Solaris stuff once and the history was 1.1x the size of the checked out files.
Interesting.
"^AI Spec" where Spec feeds into a predicate f(Spec, Version) to control printing a particular Version?
Looks like you could drop the ^AE lines.
Probably missing something. Both are working on one file at a time and have some form of changeset. One is adding from back to forward (kind of) another is from forward to back (rcs). Not sure where the reduction of work coming from.
bzr got more than that from BK, it got one of my favorite things, per-file checkin comments. I liken those to regression tests, when you start out you don't really value them but over time the value builds up. The fact that Git doesn't have them bugs me to no end. BZR was smart enough to copy that feature and that's why MySQL choose bzr when they left BK.
The thing bzr didn't care about, sadly, is performance. An engineer at Intel once said to me, firmly, "Performance is a feature".
Git's attitude, AFAIK, is that if you want per-file comments, make each file its own checkin. There are pros and cons to this.
Performance as a feature, OTOH, is one of Linus's three tenets of VCS. To quote him, "If you aren't distributed, you're not worth using. If you're not fast, you're not worth using. And if you can't guarantee that the bits I get out are the exact same bits I put in, you're not worth using."
If I remeber the history correctly, per-file commit messages were actually a feature that was quickly hacked in to get MySQL on board. It did not have that before those MySQL talks and I don't think it was very popular after.
Performance indeed killed bzr. Git was good enough and much faster, so people just got used to its weirdness.
> Git was good enough and much faster, so people just got used to its weirdness.
And boy is git weird! In Mercurial, I can mess with the file all day long after scheduling it for a commit, but one can forget that in git: marking a file for addition actually snapshots a file at addition time, and I have read that that is actually considered a feature. It's like I already committed the file, except that I didn't. This is the #1 reason why I haven't migrated from Mercurial to git yet, and now with Bitkeeper free and open source, chances are good I never will have to move to git. W00t!!!
I just do not get it... what exactly does snapshotting a file before a commit buy me?
It's probably the same idea as the one behind committing once in Mercurial and then using commit --amend repeatedly as you refine the changes. Git's method sounds like it avoids a pitfall in that method by holding your last changeset in a special area rather than dumping it into a commit so that you can't accidentally push it.
I often amend my latest commit as a way to build a set of changes without losing my latest functional change.
I always do a hg diff before I commit. If in spite of that I still screw up, I do a hg rollback, and if I already pushed, I either roll back on all the nodes, or I open a bug, and simply commit a bug fix with a reference to the bug in the bug tracking system. I've been using Mercurial since 2007 and I've yet to use --amend.
> In Mercurial, I can mess with the file all day long after scheduling it for a commit
OTOH, I find that behavior weird as I regularly add files to the index as I work. If a test breaks and I fix it, I can review the changes via git diff (compares the index to the working copy) and then the changes in total via git diff HEAD (compares the HEAD commit to the working copy).
10 or even 20% performance is not a feature. But when tools or features get a few times faster or more then their usage model changes - which means they become different features.
In short, RCS maintains a clean copy of the head revision, and a set of reverse patches to be applied to recreate older revisions. SCCS maintains a sequence of blocks of lines that were added or deleted at the same time, and any revision can be extracted in the same amount of time by scanning the blocks and retaining those that are pertinent.
Really old school revision control systems, like CDC's MODIFY and Cray's clone UPDATE, were kind of like SCCS. Each line (actually card image!) was tagged with the ids of the mods that created and (if no longer active) deleted it.