Hacker News new | past | comments | ask | show | jobs | submit login
Git as a Storage (bronevichok.ru)
134 points by todsacerdoti on Oct 8, 2021 | hide | past | favorite | 64 comments



I have a technical question that I'm not at all poised to answer, that might be stupid like all questions not in one's domain:

I recently discovered the joy that is ZFS and everything that comes with it. I understand that the technical underpinnings of git are actually extremely different (and mathematical) _but_ just how far is a ZFS snapshot from a git commit really? It seems like the gap between the two might not need a huge bridge. Could a copy-on-write filesystem benefit from more metadata that would come from being implemented in a more git-like way?


Conceptually, the two things are very much related and a birds-eye view shows a lot of similarities. But when you get into the weeds, there are some significant differences. git is optimized to store a great many historic states of files with minor differences between consecutive ones, and it assumes that these are essentially static, immutable snapshots. A COW file system that allows for snapshots is optimized more for allowing mutation of these snapshots (i.e. updating files one way in one snapshot and another way in another one). This, combined with the additional housekeeping required for a file system (disk block allocation, etc. - the actual core features) makes the implementations of the two things very different.


Very close. Actually companies such as https://postgres.ai/ use ZFS storage to provide git-like features on top of Postgres: Using copy-on-write on the underlying ZFS, you can "fork" a new branch of your DB with all the data, instantly. Then both branches can live their lives independently.

But I don't think ZFS has the equivalent of git merge though.


I've been doing this, but with BTRFS as a trial. Is there anyone who can suggest a benefit of using ZFS over BTRFS or the other way around?


I trust ZFS for providing RAID-like features via an HBA, just as much as using ZFS as a volume manager on top of a hardware RAID. My experience with BTRFS for RAID was disastrous.

I really like the ability to use zfs send/receive over ssh for offsite backups.

I'll admit, I haven't kept up with BTRFS features after abandoning it, so some of the features may have improved.


Generally BRTFS seems to be fine until it breaks, and then it's just painful and difficult to recover from (if at all possible).

I'm sure it'll get more stable in the future, but right now I wouldn't trust it with any production data the way I would trust ZFS.

With ZFS on Linux being a thing now, I'd choose that over BRTFS every time.


ZFS has been in production use for 20 years, and has seen a lot of improvements since then, but it is well tested and well understood.

BTRFS has never really achieved "production" status in most people's eyes (at least, that I've seen), and RedHat removed support for it completely (not that they support ZFS either).

ZFS is also very straightforward once you learn how it works, and that takes very little time. The commands and the documentation (man pages) are thorough and detailed. Conversely, trying to figure out how and why BTRFS does what it does has been a huge challenge for me, since nothing seems to be as straightforward as it could be. I'm not sure why that is.

Development on BTRFS is ongoing, but it is starting to feel as though it's never going to actually finish its core features, let alone add quality of life improvements. As an example of what I mean: I run a Gitlab server, which divides its data into tons of directories, some of which are huge and some of which are not, but many of which have different use cases; a Postgres database, large binary files, small text files, temporary files, etc. With ZFS, I set up my storage pool something like this:

gitlab/postgres

gitlab/git-data

gitlab/shared/lfs-data

gitlab/backups

Now everything is divided up and I can specify different caching and block sizes on each one depending on the workload. When I'm going to do an upgrade I can do an atomic recursive snapshot on gitlab/ and I get snapshots on everything.

BTRFS, as far as I can tell, doesn't let you change as many fine-grained things per-storage-space, and it doesn't have atomic recursive snapshots (and touts this as a feature). I'm not sure if it supports a similar feature to zvols, where you can create a block device using the ZFS pool storage (in case you need an ext4 file system but you want to be able to snapshot it, or similar).

<anecdote> I have never once had a single issue with ZFS and data quality, with the exception of cases where underlying storage has failed. Meanwhile, I've had BTRFS lose data every single time I've tried to use it, often within days. Obviously lots of other people haven't had that issue, but suffice to say that personally, I don't trust it. </anecdote>

Meanwhile...

ZFS doesn't support reflinks like XFS and BTRFS do, so you can't do `cp -R --reflink=always somedir/ otherdir/` and just get a reflink copy (i.e. copy-on-write per-file). On XFS, and presumably BTRFS, I can do this and get a "copy" of a 30-50 GB git repository in less than a second, which takes up no extra space until I start modifying files. On ZFS, I have to do `cp -R somedir/ otherdir/` and it copies each file individually, reading the data from each file and then writing the data to the copy of the file.

ZFS also doesn't come as part of the kernel, so you can run into issues where you upgrade the kernel but for whatever reason ZFS doesn't upgrade (maybe the installed version of ZFS doesn't build against your new kernel version) and then you reboot and your ZFS pool is gone.

You also "can't" boot from ZFS, which is to say you can but if you do something like upgrade the ZFS kernel modules and then update your ZFS pool with features that Grub doesn't understand, you now cannot boot the system until you fix it by booting into a rescue image and updating the system with a new version of Grub. Ask me how I found that out.

In the end, my experience has been that ZFS is polished, streamlined, works great out of the box with no tuning necessary, and is extremely flexible as far as doing whatever it is I want to do with it. I see no real reason not to use ZFS, honestly, except for the "hassle" of installing/updating it yourself, and there's an Ubuntu PPA from jonathanf which provides very up-to-date ZFS packages so that you can get access to bug fixes or new features (filesystem features or tooling features) very quickly, with zero effort on your part.


Thanks for the detailed comment!


Naive question, but what is the advantage compared to a classical DB dump? Faster?


I think the keyword was "instantly."


Is it correct that then the original DB and the snapshotted DB share those blocks on the file system which are unmodified?

Assume 1 row per block: Original DB "A" has 2 rows, a snapshot "B" is created, "B" deletes one row and adds a new one.

Is it true that the row which "B" took over from "A" and left unmodified resides on the same block for "A" and "B", so that if the block gets corrupted, both databases will have to deal with that corrupt row?


Yes, that's one of the core parts of copy-on-write.

It shouldn't matter if you have a reasonable setup. If you depend on other files on the drive to continue working after blocks have started to go corrupt, that's not a good system.


Yes, faster. In my experience it took less a couple seconds at most for a copy of our production database, while a dump took several hours.


I think you are correct. It would not be a huge stretch to turn a snapshotting file system into a VCS. https://en.wikipedia.org/wiki/Versioning_file_system

The IBM/Rational Clearcase version control system is an example of building a VCS on top of a versioning file system (MVFS), though MVFS uses an underlying database instead of a copy-on-write snapshot mechanism. https://www.ibm.com/support/pages/about-multiversion-file-sy...


It would be nice if ZFS snapshots were more flexible. And you could say "like git" when talking about the user experience. But it would not be like git in terms of implementation. Git's implementation is not really copy-on-write. It's deduplication.

I'd say the git method is actually pretty low in metadata, and the way you'd improve ZFS snapshots doesn't involve making them more like git.

If you did get that huge amount of work done, you could then approximate git with snapshots alone. Right now, you'd probably want snapshots and dedup to work together to approximate git using ZFS.


doesn't the zfs diff command mostly cover it? You have snapshots, diffs, and clones which is basically equivalent to commits, diffs, and branches. You're missing commit logs and that's it, right?


If you introduce a new file onto multiple ZFS 'branches', there is no way to have it stored copy-on-write.

But copying a file from one 'branch' to another is the only way to emulate a cherrypick or merge.

So after a while of active use with many branches, you're going to have a lot of redundant copies of files all over. You're no longer making proper use of snapshots, and it becomes less efficient than having a working directory and a directory full of commits that hard link to each other.


Good point, but I bet this could be changed in ZFS.


At Google, people have built both, and we use a version control system on top of a snapshoting filesytem. The snapshotting is for never losing code/state on your machines, and the version control system is for interfacing with others (code review, merging, etc). While you could use one system for both, having both layered on top makes it easier to change them to each specific workflow.


Came here just to mention Btrfs which does the same as ZFS in the sense that it is also COW by default.


Kind of an aside, but I've been toying with a simple functional language based on Joy and when it came to exposing the filesystem it seemed too fraught with impurity, so instead I'm just using git as the data storage system. Instead of strings or blobs you have handles that are essentially three-tuples of (git object hash, offset, length). It's early days yet, but so far the approach seems promising. (In re: string literals, well, your literal is in a source file, and your source file is in git, so each literal has its tuple already, eh?)


How feasible is it to store raw content in the Git content-addressable-store (CAS)? Git blobs are Zlib compressed.

I'd like to be able to store audio files uncompressed, so that they could be read directly from the CAS, rather than having to be expanded out into a checkout directory.


IIRC, a git blob has the size of the data encoded in the first 4 bytes of the file, and the data itself appended to it. It could be stored uncompressed, but I don't think there's anything in the git plumbing layer that could deal with it directly.

That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.


The header for a blob file is "blob", a space, the length of the content as ASCII integer representation, then a null byte.

    $ echo "hello world" > HELLO.txt
    $ git add HELLO.txt 
    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
    > zpipe -d | \
    > hexdump -e '"|"24/1 "%_p" "|\n"'
    |blob 12.hello world.|
    $
The header and the content get concatenated together, and the whole thing gets Zlib compressed. The SHA1 is calculated from the header-plus-content before it gets Zlib compressed.

    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \                  
    > zpipe -d | \
    > shasum
    3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -
    $
What I would like to do is record an audio file (e.g. LPCM BWF), take its SHA1 and store it in the CAS as raw content, then reference it somehow from a Git commit. That way it will be part of the history and will travel with `push` and `clone`, won't get gc'd, etc.

> That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

That's a neat suggestion! However, I don't see how it would be compatible with random access, which is important for my application.


Basically that's what Git-LFS does; it takes the SHA of the file, stores it in the git version of the file, and then stores the contents next to it. It's all transparent and works pretty well.


Hmm, but the point of Git-LFS is to store large files outside the CAS so that they don't burden operations like `clone`. And Git-LFS does lots of magic.

Maybe to achieve what I've laid out, I really would need to write a Git extension a la Git-LFS. But then vanilla Git wouldn't be able to make full use of it, which undermines the purpose of using Git in the first place.

As an alternative, maybe I just commit the darn audio files to the repo.

• In relative terms, audio files grow smaller ever year.

• Large repository size isn't as critical for a music composition tool as it is for perpetually maintained software source code.

• I'm imagining a tool to prune edit history which would consolidate commits and potentially garbage collect audio files that become unreferenced.

I wish there was a way in vanilla Git to just associate a CAS object containing arbitrary bytes with a commit object, though.


You can set git-lfs to automatically checkout your LFS operations on clone; That's a setting. Yes, they're outside the repo - But not far off.

It's the same magic you want to do; Really and truly, there's magic there, but it's a pretty thin and well defined layer of magic.


I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud


What exactly is the issue? Why do not you just use submodules? Or you might want to associate the SHA-1 of commits to files outside git? Ugh, I need sleep.


the core of cat-file.c is quite short. i think you could get the random access you want with minimal effort. ideally, upstream support for --offset and --count or what not to git; a lot of people would benefit.

https://github.com/git/git/blob/master/builtin/cat-file.c

you can absolutely make tools to expand out & load git repos into content stores. it's going to depend on the content store how you do that.


I don't know the answer to your question, but take a look at git-annex [0] if you haven't already.

One of the examples it gives is storing a music collection. If I understand correctly, I don't think it automatically compresses every file - or at least gives you the ability to not compress it.

[0]: https://git-annex.branchable.com/


You can't do that, every object is compressed (and then added to pack files with delta-encoding). Even if you did manually write a non-zlib object, things would break the first time Git tried to access that object (for example to show, compare with working copy, gc, or repack).


Thank you, that's very helpful to know! It makes perfect sense, too.

I wonder if I can abuse the pack file format. Mua ha ha. Probably not but learning about Git innards pays dividends even if the experiments don't work out.


I've been toying with the idea of writing a protocol for Git as a "Blockchain" for bank interchange. Require signatures on all commits, include a protocol for how to push commit proposals to other peers for signing, verifying commits before they're merged, etc. No mining, just a distributed transaction log via git.


Merged where? If you have the analog of github in your scheme, then you have a central authority. If you have a central authority, you don't need a blockchain [for banking], not really.

If you assume fully independent git nodes that interact, then there is no authoritative 'branch', 'repo', or anything. Every node will have it's own world view (repo and history) and to enforce a 'global consensus' you will need a BFT distributed consensus algorithm on top of your git scheme so all bank can agree on transactions and apply patches (accept transactions).

If you assume partial consensus -- only interacting banks of a transaction need to be in agreement -- then you still need a consensus for that smaller group to 'merge' their collective actions into a linear transitional narrative.

For most people using 'git', the "distributed transaction log via git" is owned and managed by GitHub, a central authority.


I've been doing something a bit similar with https://github.com/MichaelMure/git-bug/tree/master/entity/da... if you care about digging.

It's a generic distrubuted data structure in git, with identities and signature, and conflict merge. At the moment it's used to store bugs, more to come later.


Isn't this what an oracle mostly does already? Solana may be a good cheap option.


i like this idea, i think you should look further into it. their might be a market for it


What I really want to see is a blog post on Git as an undo engine.

This is related to the idea of using Git as general storage, in that the undo history can be persisted, and then reconstituted by a new process. The trick would be to make all actions compatible with conversion to and from a commit.


Emacs kind of has this in the form of magit-wip-mode. It doesn't sync with the undo system but it does persist every file save event since the last "real" commit.


I need to look into this.

Just yesterday I lost some work (only maybe 10 minutes worth) when I was updating my org notes. I staged some files, committed them, but made a typo in the commit message. I ended up reverting the commit when I meant to amend. Then I discarded the the staged reverted changes and noticed the status said I was still reverting. So I looked up the command to get me out of that state. I ran 'git revert --abort' and it blew away my unstaged changes. Ah well, those versioned backups I have Emacs do are going to save me this time, or so I thought.


If you did actually commit, you should be able to find the changes on the reflog

In magit, it's accessable from the log menu

Create a branch, reset it to the badly worded commit hash and then you can try to ammend again.


I was breaking a bunch of unstaged changes into two commits, and the part I lost had not been committed.


I've seen programs that serialize the undo manager, but is there an advantage to using git instead? I suppose it does most of the work for you, but you'll still need to manage actions that can be undone but don't directly modify a file (these are entirely app dependent, but could be something like changing which tab document is selected).


The advantage is that Git's data structure is an open design, with existing tools able to introspect and possibly manipulate it, with potential users and contributors able to leverage their existing knowledge, and with documents/projects still being parseable decades from now.

I have dreams of implementing an music composition tool / audio editor with a line-oriented edit-decision-list (EDL) text file format that where changes could map coherently onto git history. Ideally, that EDL format would be an open standard as well: I've contemplated the Pure Data file format and AES31 as possible candidates. This is just at the conceptual stage, though.


One problem with git is that it assumes you have some sort of "diff" utility for before and after snapshots. For some types of things, that's not feasible - you can't easily diff two images and work out what filter was used. Sometimes you instead you want to create the diff, and generate the "after" snapshot. You could emulate this by stowing the information somewhere hidden and writing a fake "diff" that simply retrieves it - but saving the edit tree and reconstituting it is supposed to be git's job! You end up reimplementing bits of git, just to make git work.


I suppose the "diff" limitation would constrain the potential use cases. For my purposes, I don't think that would be a problem, because the project format would consist of two types of files:

• PCM audio files which are captured once and then never modified.

• Line oriented text files for which the traditional "diff" functionality suffices.


one of these years i'll have something fun to show you


Few git-inspired version controlled databases out there if performance becomes an issue. Dolt & TerminusDB are the most prominent.

https://github.com/terminusdb/terminusdb https://github.com/dolthub/dolt


I've recently had a similar idea for when you want to track metrics of a Git repository over time (code size, line count, etc).

I would love to create some a script to take a measurement of the current tree, then run a tool that runs my script at ~every commit so I can draw a graph of how the metric changes over time.

It's a bit tricky: if you change the script, you need to re-run the analysis at every commit. It starts looking a little bit like a build system, but integrated over time.

I've thought of calling this GitReduce or similar, since it has some similarity to MapReduce: first a "map" step runs at every commit, then the "reduce" step combines all of the individual outputs into a single graph or whatever.

Ideally Git itself could be the only storage engine, so you can trivially serve the results from GitHub.


git provides a couple of options for running a script against the code for every commit


Even without built in options, writing a for-loop over the result of git log or rev-list is more or less two lines of scripting. Same with a walk over rev-parse HEAD^.

Then you also have git bisect but that's more for finding when some metric started to show anomalies.


Would that be git for-each-ref [1] or something different?

[1]: https://git-scm.com/docs/git-for-each-ref


Commits are not refs, only the head of a branch (or tag) is a ref.


I thought it was a neat article, I assumed it was talking about git lfs.

It would be neat if github could store all its data in git, similar to fossil scm. But I suppose microsoft would not want to lose lockin.


"It would be neat if github could store all its data in git, similar to fossil scm."

Yes, that would be very nice - it is unfortunate that you have to make API calls (over http) to get things like issues ...

I think you can get the wiki with plain old 'git' ? I forget ...


> I think you can get the wiki with plain old 'git' ? I forget ...

This is correct. The wiki for a repo is accessible as a separate repository named with a suffix of “.wiki”.

So if user foo has a repo bar with an associated wiki, and the repo URL is https://github.com/foo/bar then you can clone the repo and the wiki respectively over SSH by:

    git clone git@github.com:foo/bar.git
and

    git clone git@github.com:foo/bar.wiki.git
I wish they’d do the same for all other repository meta data including issues, repository description, etc


I believe this is true of gitlab and other providers as well, correct ?

That is, you need API calls to get things like issues.

Is there a single tool that will handle downloading (and the associated API calls) from all of the major providers ? Or is each API tool specifically for either github or gitlab or sr.ht or whatever ?


Every issue system has its own API and today there's no standard for interchange. It would be interesting to see an attempt to try to build a reusable standard, but I don't know what sort of standards agency exists with the guts to try something like, I don't envy the political battle that would entail, and having seen some of the horrors of bespoke Jira and TFS configurations I'm mostly such a standard would either be too minimalist and disappoint too many people or too maximalist and impossible to build.


Yes, I understand each providers standard is different - and I agree that would be a real mess to wrangle.

What I am wondering is are there any tools that use these APIs that have built-in support for multiple provider APIs ? Or does every tool that (helps you manage or download issues, etc.) just built for a particular provider ?

Thanks.


git-bug, the one mentioned in the article here, has some documentation on its README of how well its importer/exporter tools support Github, Gitlab, Jira, and Launchpad: https://github.com/MichaelMure/git-bug

Most of the other such tools I've seen barely have the resources to import/export a single such API. git-issue only has Github import it looks like. https://github.com/dspinellis/git-issue

There's perceval which is designed to be a generic archival tool and supports lots of APIs, but only dumps them to source-specific formats and would still need a lot of work if you tried to use issues from different APIs together: https://github.com/chaoss/grimoirelab-perceval


> I thought it was a neat article

I think the article talks about the "What" part of the problem, but the actual code is much more interesting in the "How" sense.

Like the git-ref stuff makes sense as you read the code

https://github.com/ligurio/git-test/blob/master/bin/git-test...

There was a similar set of additions to svn in the past with "svn propedit" in the workflows which I used in a previous workplace.

It was not pretty, because it was like embedding JIRA into svn - but it meant machines could flip state to state with commits during build+test and restart from that point without an independent DB to track the "current state" & people with commit access could nudge a stuck build out without losing "who did what".


All commit data is stored in git and the beauty of git outside of platform metadata is that you can add a new remote and never be locked in.


If you are using Restic Backup aren't you coming close to what's being recommended here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: