1. I learned from Linus that git is a decentralized "content" database that gives identity to different versions of the same content and allows them to be compared and merged. (as opposed to a traditional VCS which is more of a "delta" database)
2. I learned from Damien that CouchDB is a decentralized "document" database that gives identity to different versions of the same document and allows them to be compared and merged.
I've been wondering just what the difference really is.
Git retains its change history, and if I remember correctly, uses the sha hash to verify that no one changed the history. Couchdb retains the change history until someone compresses the db (garbage collect). Its revision history is only used as a means of merging out of sync documents.
In addition, Git makes individual merges in a document, and when there's a conflict, it resorts to human intervention. Couchdb does not make merges inside a document, and will not do conflict resolution. Instead, using a brief set of rules, it picks one version of the document over another.
Of course you could. I guess one of the more important points that are made somewhat subtly by this hack is that you can easily fiddle with the functionality of a simple yet flexible piece of software (git) - that's what the Unix philosophy preaches. You start by taking lots of simple components, that are flexible and do not make assumptions about their use, glue them up together into bigger pieces.
Large homogeneous systems that only provide one functionality are hard if not impossible to extend to make them do just what you want. Unless you want what they do in the first place, which makes the whole customization thing pointless ;-)
Not really a new idea to use a VCS as a document database, other folks have back-ended Wikis and document management systems with SVN, for example. But a nice story regardless, and another nudge for me to really take a closer look at git.
First distributed VCS I heard of was darcs. I have not used it either, but from what I've read about the two, git has some real advantages in speed and robust-ness.
Not really a new idea to use a VCS as a document database, other folks have back-ended Wikis and document management systems with SVN, for example.
It's not the same thing. Using SVN as a document database is equivalent to using the filesystem as a document database. The only thing extra you get with SVN is a revision history.
Git actually is a content database. Its version control capabilities are built on top of that.
When you do a checkout on a traditional VCS you're telling the VCS to apply patch set x to the file system. When you do a git checkout you're telling git to load content x from the database into the filesystem.
Darcs is just different, also when this Git buzz started the Darcs people were wondering why they didn't just contribute to Darcs since their ideas are almost the same (except for some things like the merge algorithm)... they just realized that some people just don't program in Haskell :(
Git and Darcs have very different underlying models. Most importantly, Git is content-centric (diffs are generated only as byproducts) while Darcs is patch-centric (any given "version" is just a sum of all the patches that produce it). Linus, for one, feels strongly that this is an important distinction.
I've built a Git datastore that works in much the same way. I constructed it to interface with Rails in a manner as close to ActiveRecord as I could. I actually ended up basing it on the CouchRest module for interfacing Rails and CouchDB, so there are a lot of similarities.
As soon as I find some time I'll do a full write up for anyone that's interested.
Last night I was thinking about how much I hate existing file managers. How the metaphor was great when
you had 2 meg hard drives and a few folders and maybe a couple hundred files tops, but how now it
falls apart. There was a project called lifestreams at Yale ( http://cs-www.cs.yale.edu/homes/freeman/lifestreams.html
) that had some interesting ideas about allowing you to see your documents as a versioned timeline.
Using a dcvs type system as a backend would get you a lot of that plus more, you could explore different
ideas for files on different branches. I do however want more...
I want usable metadata like the BeFS had. Where any arbitrary metadata key/value paris could
be attached to a 'file', For example, contacts in the BeFS were basically empty files with metadata
attached. That metadata included name, address etc. Any application could uses the data, augment it etc.
The file manager ( tracker ) could query the data to create live 'searches' that looked just like folders.
You could add new types and what not. Very powerful when the filesystem is a database. BeOS
( and now haiku ) kept the traditional file manager around as well. Others have approached this idea,
but haven't really gone after it.
What I thought of that I really wanted to see was something like a document-centric database like couchdb
that is backed with dcvs like features that operates just like a regular filesystem, i mount it, i can drag and
drop content into it, save into it, make it look like a 'normal' fs to existing applications, but all that path
info etc is just saved searches on metadata that group information together and you can tag those
searches with metadata themselves.
Running git on top file based storage would also be an interesting project. There is a proprietary commercial implementation of a similar idea from Caringo called CFS http://www.caringo.com/products_cfs.html. An opensource version based on git would be nice.
Can Git be made to handle binary files as nicely as [Dropbox](http://getdropbox.com/) does?
On [one](https://www.getdropbox.com/tour#3) of the pages from their tour, it shows how if you save a 10 MB PSD to Dropbox twice, it shows you a table with a row for the original file and another row for '400 extra bytes' or whatever.
> Dropbox is also smart with how it tracks changes to files. Every time you make a change, Dropbox only transfers the piece of the file that changed (also known as block-level or delta sync), making it easy to work with big files like Photoshop or Powerpoint documents.
First off, the "block-level" sync is exactly what rsync is for. You'll probably have better results using that. (http://www.samba.org/rsync/) You could track a list of local file paths in git, but sync the binaries themselves with rysnc. (They complement each other well.)
Tracking changes in binary files (which cannot be merged in any reasonably generic fashion) is a fundamentally different issue than tracking changes in text, particularly source code. Git is designed to do the latter. While you can use it to track changes in binaries, merging doesn't make sense anymore, and hashing / scanning big binary files for changes is significantly slower. (A bunch of images generally won't matter, but I wouldn't use it to track, say, video, or large database dumps.)
Quite interesting. On a related theme, a lot of people have begun using vcs' (especially git, but also some times svn) as the data-storage backend for their applications, where they would have used a rdbms a few years ago. Eg. [Jekyll](http://github.com/mojombo/jekyll/tree/master) etc.
Git is really nifty. What I like about it is that, when you get down to it, it's amazingly fast. There are many reported cases of people holding their home directories as Git repos on a day-to-day basis.
People store their home dir with hg too (I do, many coworkers do), and before the DVCS era people used svn and cvs. This is far from a git-unique thing.
Indeed. I've used svn, monotone, hg, and git for this (moving from each to the next over the last three years or so), and they all worked fine. In my experience, the special features of each have relatively little significance for typical operations on home directories.
I have often heard of people putting their /etc in cvs/svn, but I never bothered due to having to setup a repo. That is until I had git at which point it became just a git init.
Right. Monotone (an earlier DVCS, inspiration for git), while a really nice system technically, expected you to set up keys for authorizing changes to the repository before you could use it* . It was a little thing, but reducing setup to just "git init" (or "hg init") makes for less friction to putting stuff in VC just because...and then you find out it's good for things you would have never anticipated.
Nice hacking btw. Note that spidermonkey is fairly easy to build on its own. It lives in the Mozilla repository, but it doesn't have any dependencies to the rest of Mozilla.
Don't have time to read it now but looks very interesting thinking so I will read it later.
I am using bazaar to do daily backups at some webapp that stores serialized data in files. I was thinking a little about syncing it between servers with bazaar if needed but you seem to go a few steps further. :)
There are some things that I find extremely enticing about Git. First of, there's the speed which I already mentioned. The second is how it's designed to be decentralized - and my amazement at this probably stems from the fact that this is the first decentralized VCS that I've ever worked with.
I've been using rsync to keep certain directories of my various computers synchronized, but I think I'll go one step further soon enough: for my Linux computers, I'll keep some parts of my home directory in Git in order to maintain an identical user interface / configuration on those machines.
1. I learned from Linus that git is a decentralized "content" database that gives identity to different versions of the same content and allows them to be compared and merged. (as opposed to a traditional VCS which is more of a "delta" database)
2. I learned from Damien that CouchDB is a decentralized "document" database that gives identity to different versions of the same document and allows them to be compared and merged.
I've been wondering just what the difference really is.