Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is Git more than a version control system? Reimplementing CouchDB with Git+Bash... (ordecon.com)
82 points by ivanstojic on April 22, 2009 | hide | past | favorite | 36 comments


I've been waiting for this. Ever since:

1. I learned from Linus that git is a decentralized "content" database that gives identity to different versions of the same content and allows them to be compared and merged. (as opposed to a traditional VCS which is more of a "delta" database)

2. I learned from Damien that CouchDB is a decentralized "document" database that gives identity to different versions of the same document and allows them to be compared and merged.

I've been wondering just what the difference really is.


Git retains its change history, and if I remember correctly, uses the sha hash to verify that no one changed the history. Couchdb retains the change history until someone compresses the db (garbage collect). Its revision history is only used as a means of merging out of sync documents.

In addition, Git makes individual merges in a document, and when there's a conflict, it resorts to human intervention. Couchdb does not make merges inside a document, and will not do conflict resolution. Instead, using a brief set of rules, it picks one version of the document over another.


Thanks for that, but couldn't you theoretically very easily script git to automatically choose based on prior requirements?


Of course you could. I guess one of the more important points that are made somewhat subtly by this hack is that you can easily fiddle with the functionality of a simple yet flexible piece of software (git) - that's what the Unix philosophy preaches. You start by taking lots of simple components, that are flexible and do not make assumptions about their use, glue them up together into bigger pieces.

Large homogeneous systems that only provide one functionality are hard if not impossible to extend to make them do just what you want. Unless you want what they do in the first place, which makes the whole customization thing pointless ;-)

But that is self evident, or at least should be.


Is there a way to make Git choose one over the other rather than merging? --mine or --yours or something?


There's a git merge strategy called "ours" which - as you might guess - does a quasimerge which always results in whatever you have in your HEAD.


Well, it's simple: different command line switches :-)


I hope nobody minds this butchery, but I really wanted to finally learn some more about Git and refresh my Bash skills.

It's true that the road to hell is paved with good intentions :-)


Au contraire; congratulations on a great hack.

I think a lot of us have had this one cooking in our minds for a while and its great to see that somebody got it Done.

Bravo.


Much obliged for your comment. It's comments like this one, and others I have received that drive me to more hacking :-)


Not really a new idea to use a VCS as a document database, other folks have back-ended Wikis and document management systems with SVN, for example. But a nice story regardless, and another nudge for me to really take a closer look at git.

First distributed VCS I heard of was darcs. I have not used it either, but from what I've read about the two, git has some real advantages in speed and robust-ness.


Not really a new idea to use a VCS as a document database, other folks have back-ended Wikis and document management systems with SVN, for example.

It's not the same thing. Using SVN as a document database is equivalent to using the filesystem as a document database. The only thing extra you get with SVN is a revision history.

Git actually is a content database. Its version control capabilities are built on top of that.

When you do a checkout on a traditional VCS you're telling the VCS to apply patch set x to the file system. When you do a git checkout you're telling git to load content x from the database into the filesystem.


Darcs is just different, also when this Git buzz started the Darcs people were wondering why they didn't just contribute to Darcs since their ideas are almost the same (except for some things like the merge algorithm)... they just realized that some people just don't program in Haskell :(


Git and Darcs have very different underlying models. Most importantly, Git is content-centric (diffs are generated only as byproducts) while Darcs is patch-centric (any given "version" is just a sum of all the patches that produce it). Linus, for one, feels strongly that this is an important distinction.


I've built a Git datastore that works in much the same way. I constructed it to interface with Rails in a manner as close to ActiveRecord as I could. I actually ended up basing it on the CouchRest module for interfacing Rails and CouchDB, so there are a lot of similarities.

As soon as I find some time I'll do a full write up for anyone that's interested.


Funny that this should appear this morning.

Last night I was thinking about how much I hate existing file managers. How the metaphor was great when you had 2 meg hard drives and a few folders and maybe a couple hundred files tops, but how now it falls apart. There was a project called lifestreams at Yale ( http://cs-www.cs.yale.edu/homes/freeman/lifestreams.html ) that had some interesting ideas about allowing you to see your documents as a versioned timeline. Using a dcvs type system as a backend would get you a lot of that plus more, you could explore different ideas for files on different branches. I do however want more...

I want usable metadata like the BeFS had. Where any arbitrary metadata key/value paris could be attached to a 'file', For example, contacts in the BeFS were basically empty files with metadata attached. That metadata included name, address etc. Any application could uses the data, augment it etc. The file manager ( tracker ) could query the data to create live 'searches' that looked just like folders. You could add new types and what not. Very powerful when the filesystem is a database. BeOS ( and now haiku ) kept the traditional file manager around as well. Others have approached this idea, but haven't really gone after it.

What I thought of that I really wanted to see was something like a document-centric database like couchdb that is backed with dcvs like features that operates just like a regular filesystem, i mount it, i can drag and drop content into it, save into it, make it look like a 'normal' fs to existing applications, but all that path info etc is just saved searches on metadata that group information together and you can tag those searches with metadata themselves.


an additional thought... then filesystem backups just become a push/pull scenario between different instances. man, i really like this general idea.


This has been on my mind for a while: Does anyone know if DropBox is developed on top of an existing source control system?


Running git on top file based storage would also be an interesting project. There is a proprietary commercial implementation of a similar idea from Caringo called CFS http://www.caringo.com/products_cfs.html. An opensource version based on git would be nice.


Interesting question that I've been wondering too. I had a prototype of something similar running in 2006, built on top of git.


I always assumed it ran on SVN.


Can Git be made to handle binary files as nicely as [Dropbox](http://getdropbox.com/) does?

On [one](https://www.getdropbox.com/tour#3) of the pages from their tour, it shows how if you save a 10 MB PSD to Dropbox twice, it shows you a table with a row for the original file and another row for '400 extra bytes' or whatever.

> Dropbox is also smart with how it tracks changes to files. Every time you make a change, Dropbox only transfers the piece of the file that changed (also known as block-level or delta sync), making it easy to work with big files like Photoshop or Powerpoint documents.


First off, the "block-level" sync is exactly what rsync is for. You'll probably have better results using that. (http://www.samba.org/rsync/) You could track a list of local file paths in git, but sync the binaries themselves with rysnc. (They complement each other well.)

Tracking changes in binary files (which cannot be merged in any reasonably generic fashion) is a fundamentally different issue than tracking changes in text, particularly source code. Git is designed to do the latter. While you can use it to track changes in binaries, merging doesn't make sense anymore, and hashing / scanning big binary files for changes is significantly slower. (A bunch of images generally won't matter, but I wouldn't use it to track, say, video, or large database dumps.)


Quite interesting. On a related theme, a lot of people have begun using vcs' (especially git, but also some times svn) as the data-storage backend for their applications, where they would have used a rdbms a few years ago. Eg. [Jekyll](http://github.com/mojombo/jekyll/tree/master) etc.


Git is really nifty. What I like about it is that, when you get down to it, it's amazingly fast. There are many reported cases of people holding their home directories as Git repos on a day-to-day basis.

It's just amazing what a good VCS can let you do.


People store their home dir with hg too (I do, many coworkers do), and before the DVCS era people used svn and cvs. This is far from a git-unique thing.


Indeed. I've used svn, monotone, hg, and git for this (moving from each to the next over the last three years or so), and they all worked fine. In my experience, the special features of each have relatively little significance for typical operations on home directories.

It's worth picking any and just doing it, though.


I have often heard of people putting their /etc in cvs/svn, but I never bothered due to having to setup a repo. That is until I had git at which point it became just a git init.


Right. Monotone (an earlier DVCS, inspiration for git), while a really nice system technically, expected you to set up keys for authorizing changes to the repository before you could use it* . It was a little thing, but reducing setup to just "git init" (or "hg init") makes for less friction to putting stuff in VC just because...and then you find out it's good for things you would have never anticipated.

* One style difference between the two. See Graydon Hoare's comment here (http://www.mail-archive.com/monotone-devel@nongnu.org/msg080...).


Nice hacking btw. Note that spidermonkey is fairly easy to build on its own. It lives in the Mozilla repository, but it doesn't have any dependencies to the rest of Mozilla.


Yup, I know :-) I'm kind of a lazy bum so I wanted to skip one step. I figure there are more worthy things to hack than compiling spidermonkey.


So, um, has anybody tried using this?

Don't have time to set it up at the moment, but I'd been wondering exactly the same thing.


Don't have time to read it now but looks very interesting thinking so I will read it later.

I am using bazaar to do daily backups at some webapp that stores serialized data in files. I was thinking a little about syncing it between servers with bazaar if needed but you seem to go a few steps further. :)


There are some things that I find extremely enticing about Git. First of, there's the speed which I already mentioned. The second is how it's designed to be decentralized - and my amazement at this probably stems from the fact that this is the first decentralized VCS that I've ever worked with.

I've been using rsync to keep certain directories of my various computers synchronized, but I think I'll go one step further soon enough: for my Linux computers, I'll keep some parts of my home directory in Git in order to maintain an identical user interface / configuration on those machines.


Thanks for reminding me how paltry my Bash skills are. Yeah, thanks a lot... :)


You bastards killed my webhost's MySql instance :-)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: