Hacker News new | past | comments | ask | show | jobs | submit login

I though SCCS had the same problems as RCS. What did it do differently?



RCS is patch based, the most recent version is kept in clear text and the previous version is stored as a reverse patch and so on back to the first version. So getting the most recent version could be fast (it isn't) but the farther back you go in history the more time it takes. And branches are even worse, you have to patch backwards to the branch point and then forwards to the tip of the branch.

SCCS is a "weave". The time to get the tip is the same as the time to get the first version or any version. The file format looks like

  ^AI 1
  this is the first line in the first version.
  ^AE 1
That's "insert in version 1" data "end of insert for version one".

Now lets say you added another line in version 2:

  ^AI 1
  this is the first line in the first version.
  ^AE 1
  ^AI 2
  this is the line that was added in the second version
  ^AE 2
So how do you get a particular version? You build up the set of versions that are in that version. In version 1, that's just "1", in version 2, that's "1, 2". So if you wanted to get version 1 you sweep through the file and print anything that's in your set. So you print the first line, get to the ^AI 2 and look to see if that's in your set, it isn't, so you skip until you get to the ^AE 2.

So any version is the same time. And that time is fast, the largest file in our source base is slib.c, 18K lines, checks out in 20 milliseconds.


I had... much too extensive experience both with SCCS weaves and with hacking them way back in the day; I even wrote something which sounds very like your smoosh, only I called it 'fuse'. However, I wrote 'fuse' as a side-effect of something else, 'fission', which split a shorter history out of an SCCS file by wholesale discarding of irrelevant, er, strands and of the history relating to them. I did this because the weave is utterly terrible as soon as you start recording anything which isn't plain text or which has many changes in each version, and we were recording multimegabyte binary files in it by uuencoding them first (yes, I know, the decision was made way above my pay grade by people who had no idea how terrible an idea it was).

Where RCS or indeed git would have handled this reasonably well (indeed the xdelta used for git packfiles would have eaten it for lunch with no trouble), in SCCS, or anything weave-based, it was an utter disaster. Every checkin doubled the number of weaves in the file, an exponential growth without end which soon led to multigigabyte files which xdelta could have represented as megabytes at most. Every one-byte addition or removal doubled up everything from that point on.

And here's where the terribleness of the 'every version takes the same time' decision becomes clear. In a version control system, you want the history of later versions (or of tips of branches) overwhelmingly often: anything that optimizes access time for things elsewhere in the history at the expense of this is the wrong decision.

When I left, years before someone more courageous than me transitioned the whole appalling mess to git, our largest file was 14GiB and took more than half an hour to check out.

The SCCS weave is terrible. (It's exactly as good a format as you'd expect for the time, since it is essentially an ed script with different characters. It was a sensible decision for back then, but we really should put the bloody thing out of its misery, and ours.)


Huh. Now I wonder how BK resolved this.


Yeah. I suspect the answer is 'store all binary data in BAM', which then uses some different encoding for the binary stuff -- but that then makes my gittish soul wonder why not just use that encoding for everything. (It works for git packfiles... though 'git gc' on large repos is a total memory and CPU hog, one presumes that whatever delta encoding BAM uses is not.)


We support the uuencode horror for compat (and for smaller binaries that don't change) but the answer for binaries is BAM, there is no data in the weave for BAM files.

I don't agree that the weave is horrible, it's fantastic for text. Try git blame on a file in a repo with a lot of history then try the same thing in BK. Orders and orders of magnitude faster.

And go understand smerge.c and the weave lightbulb will come on.


Yeah, that's the problem; it's optimizing for the wrong thing. It speeds up blame at the expense of absolutely every other operation you ever need to carry out; the only thing which avoids reading (or, for checkins, writing) the whole file is a simple log. Blame is a relatively rare operation: its needs should not dominate the representation.

The fact that the largest file you mention is frankly tiny shows why your performance was good: we had ~50,000 line text files (yeah, I know, damn copy-and-paste coders) with a thousand-odd revisions and a resulting SCCS filesize exceeding three million lines, and every one of those lines had to be read on every checkout: dozens to hundreds of megabytes, and of course the cache would hardly ever be hot where that much data was concerned, so it all had to come off the disk and/or across NFS, taking tens of seconds or more in many cases. RCS could have avoided reading all but 50,000 of them in the common case of checkouts of most recent changes. (git would have reduced read volume even more because although it is deltified the chains are of finite length, unlike the weave, and all the data is compressed.)


Give me a file that was slow and lets see how it is in BitKeeper. I bet you'll be impressed.

50K lines is not even 3x bigger than the file I mentioned. Which we check out in 20 milliseconds.

As for optimizing blame, you are missing the point, it's not blame, it's merge, it's copy by reference rather than copy by value.


I'd do that if I was still working there. I can probably still get hold of a horror case but it'll take negotiation :)

(And yes, optimizing merge matters too, indeed it was a huge part of git's raison d'etre -- but, again, one usually merges with the stuff at the tip of tree: merging against something you did five years ago is rare, even if it's at a branch tip, and even rarer otherwise. Having to rewrite all the unmodified ancient stuff in the weave merely because of a merge at the tip seems wrong.)

(Now I'm tempted to go and import the Linux kernel or all of the GCC SVN repo into SCCS just to see how big the largest weave is. I have clearly gone insane from the summer heat. Stop me before I ci again!)


Our busiest file is 400K checked out and about 1MB for the history file lz4 compressed. Uncompressed is 2.2M and the weave is 1.7M of that.

Doesn't seem bad to me. The weave is big for binaries, we imported 20 years of Solaris stuff once and the history was 1.1x the size of the checked out files.


Presumably if you then delete that first line in the third version, you get something like

  ^AI 1
  this is the first line in the first version.
  ^AE 1
  ^AD 3
  ^AI 2
  this is the line that was added in the second version
  ^AE 2

?


Close. By the way there is a bk _scat command (sccs cat, not poop) that dumps the ascii file format so you can try this and see.

The delete needs to be an envelope around the insert so you get

  ^AD 3
  ^AI 1
  this is the first line in the first version.
  ^AE 1
  ^AE 3
  ^AI 2
  this is the line that was added in the second version
  ^AE 2
That whole weave thing is really cool. The only person outside of BK land that got it was Braam Cohen in Codeville, I think he had a weave.


It's sort of suprising then that a delete doesn't just and end-version on the insert instead:

  ^AI 1..2
  this is the first line in the first version.
  ^AE 1..2
  ^AI 2
  this is the line that was added in the second version
  ^AE 2
This way the reconstruction process wouldn't need to track blocks-within-blocks.


Interesting. "^AI Spec" where Spec feeds into a predicate f(Spec, Version) to control printing a particular Version? Looks like you could drop the ^AE lines.


Sounds like equivalent representations no? Limit the scope of the I lines or wrap them in D lines.


Probably missing something. Both are working on one file at a time and have some form of changeset. One is adding from back to forward (kind of) another is from forward to back (rcs). Not sure where the reduction of work coming from.


SUN must like the scat names :) I used to use scat tool for debugging core files.


That actually is pretty neat


Aha, so that's where bzr got it from. :-)


bzr got more than that from BK, it got one of my favorite things, per-file checkin comments. I liken those to regression tests, when you start out you don't really value them but over time the value builds up. The fact that Git doesn't have them bugs me to no end. BZR was smart enough to copy that feature and that's why MySQL choose bzr when they left BK.

The thing bzr didn't care about, sadly, is performance. An engineer at Intel once said to me, firmly, "Performance is a feature".


Git's attitude, AFAIK, is that if you want per-file comments, make each file its own checkin. There are pros and cons to this.

Performance as a feature, OTOH, is one of Linus's three tenets of VCS. To quote him, "If you aren't distributed, you're not worth using. If you're not fast, you're not worth using. And if you can't guarantee that the bits I get out are the exact same bits I put in, you're not worth using."


Big fan of `git commit -vp` here. Enables me to separate the commits according to concerns.


I suppose that in Git if you wanted to group a bunch of these commits together you could do so with a merge commit.


If I remeber the history correctly, per-file commit messages were actually a feature that was quickly hacked in to get MySQL on board. It did not have that before those MySQL talks and I don't think it was very popular after.

Performance indeed killed bzr. Git was good enough and much faster, so people just got used to its weirdness.


> Git was good enough and much faster, so people just got used to its weirdness.

And boy is git weird! In Mercurial, I can mess with the file all day long after scheduling it for a commit, but one can forget that in git: marking a file for addition actually snapshots a file at addition time, and I have read that that is actually considered a feature. It's like I already committed the file, except that I didn't. This is the #1 reason why I haven't migrated from Mercurial to git yet, and now with Bitkeeper free and open source, chances are good I never will have to move to git. W00t!!!

I just do not get it... what exactly does snapshotting a file before a commit buy me?


It's probably the same idea as the one behind committing once in Mercurial and then using commit --amend repeatedly as you refine the changes. Git's method sounds like it avoids a pitfall in that method by holding your last changeset in a special area rather than dumping it into a commit so that you can't accidentally push it.

I often amend my latest commit as a way to build a set of changes without losing my latest functional change.


I always do a hg diff before I commit. If in spite of that I still screw up, I do a hg rollback, and if I already pushed, I either roll back on all the nodes, or I open a bug, and simply commit a bug fix with a reference to the bug in the bug tracking system. I've been using Mercurial since 2007 and I've yet to use --amend.


> In Mercurial, I can mess with the file all day long after scheduling it for a commit

OTOH, I find that behavior weird as I regularly add files to the index as I work. If a test breaks and I fix it, I can review the changes via git diff (compares the index to the working copy) and then the changes in total via git diff HEAD (compares the HEAD commit to the working copy).


Did you know you can do 'git add -N'? That will actually just schedule the file to be added, but won't snapshot it.


Cool. I've used bzr but never knew about per-file comments.


10 or even 20% performance is not a feature. But when tools or features get a few times faster or more then their usage model changes - which means they become different features.


In short, RCS maintains a clean copy of the head revision, and a set of reverse patches to be applied to recreate older revisions. SCCS maintains a sequence of blocks of lines that were added or deleted at the same time, and any revision can be extracted in the same amount of time by scanning the blocks and retaining those that are pertinent.

Really old school revision control systems, like CDC's MODIFY and Cray's clone UPDATE, were kind of like SCCS. Each line (actually card image!) was tagged with the ids of the mods that created and (if no longer active) deleted it.


| CDC's MODIFY and Cray's clone UPDATE, were kind of like SCCS

Do you have references? I've heard of these but haven't come across details after much creative searching since they are common words.



Thank you! A peek into the (as far as I know) into root node of source control history.


I've heard that too. It comes from card readers somehow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: