Seafile (a file sync storage) is inspired by git to store the files (internally there are repositories, branches and commits). However the file are not stored directly:
> A file is further divided into blocks with variable lengths. We use Content Defined Chunking algorithm to divide file into blocks.
> This mechanism makes it possible to deduplicate data between different versions of frequently updated files, improving storage efficiency. It also enables transferring data to/from multiple servers in parallel.
I use it on old PC without issue. Drawback: since the files are not stored in clear, in case of data corruption of the Seafile repositories, I need backup (never happened to me).
Seafile is fantastic and I'm surprised I don't see more discussion about it around here. I've been running it on a VPS with MinIO as my object storage for about two years now, ~4TB of data just shy of 100,000 files. It syncs fast, stable af, and I "own" all my data. Can't recommend it enough.
How does Seafile's chunking compare to git's packfiles which can store binary deltas. [0] While git conceptually stores full files that doesn't mean that the the actual implementation isn't more efficient.
I did exactly this and use it as a "file system adapter" on my WordPress installation (to handle uploads/media). I tried to submit it as a plugin but they said "no" because I can't use "git" in the name -- and fuck that. So, I'm the only person on earth that does this (AFAIK).
Do you have this plugin available anywhere at all? I can't say it'd fit my workflow or anything without knowing more (so don't go out of your way!) but I'm very intrigued in to how it works
No, because a branch is a (re-assignable) name for some commit and cannot point at different commits while in the same repository at the same time. A "branch but at different commits" simply makes no sense. You can however create a worktree with a detached HEAD pointing at any commit. By default it seems to create a branch for each worktree.
You can have multiple work trees for the same commit, for that matter, as long as they are the same branch (so with different branches pointing at the same commit, or detached heads)
It's impractical to list a directory with millions of entries (e.g., commits) that require a lot of work to find. One might want to organize such a pile of things in a way that naturally limits the number of items to list -- i.e., one might want to add paging, so you'd have page 1, page 2, etc., with each page having some small number of items, say, 1000 (e.g., commits).
Very few repositories have enough commits to be a problem, and ls is not very important either. The prompt was just that cd works. It doesn't have to show the big pile of commits.
Also you just have to use ls -f. Edit: Oh, you even mention this yourself in another comment. Serious non-problem then. If you're worried about the filesystem side, it could also load the list of commits gradually.
Fair, it'll likely break most tools written in the last 10-15 years.
> (e.g., commits) that require a lot of work to find.
That's solvable with a cache. I'm surprised git doesn't seem to have one for those, at least going by how long it takes to generate a full "shorthand log" on a large repo.
Caching is nice, but if it takes an hour when the cache is cold then it's not usable even if a directory with millions of entries posed no problems for any tools. Now for most projects that's not going to happen, but for something like the Linux kernel it just might. And there might be many caches to warm, and much to cache. For a large enough project and a computer that one might not think of as small that might just fail to work at all. This is why as Git grows up it's using Bloom filters to optimize things like git blame and git log on one file rather than adopting the fully relational model of Fossil. (Please read into this all the chagrin I'm feeling about that, because I'm quite the fan of Fossil's relational model.)
PS, the single biggest problem with million-entry directories is the propensity of tools like ls(1) to want to sort the darned things, so one has to remember to use `ls -f`, or to at least use the C locale to get memset() collations instead of much much slower Unicode collations. Another problem is that the POSIX stat(2) family of functions combine reading metadata that could come from just the directory (e.g., a file's inode number) (and which contents has already been read) with reading metadata that requires [possibly much] extra I/O to get, so if you're doing `ls -l` on a million-entry directory you might as well go on vacation (but make sure to send the output to some file, cause your scrollback buffer just won't do).
Depends on your personal preferences. I've been using cgit for this for over a decade now, it's blazing fast and I just know it inside out. It's running on a local server hosting all the repos, so obviously requires a little more setup; don't know if I'd install a web server on my machine just for that.
It should automatically map all the commits of each file to .old, .older, .old2.old and so on. For the true „version management in a file system“ feeling :-)
I think this is a great idea. In the old days before fuse was so widely used I saw this same idea used in the jvm since it supports custom protocol handlers that you can use for vfs operations.
Neat! We started working on something similar, using LD_PRELOAD. After setting some env variables we'd see a commit's files layered at some path, on top of artifacts saved for this commit. The goal is to run jobs that expect to access both git files and build artifacts, while avoid duplication of storage. FUSE would be better but users don't have enough permissions.
> FUSE would be better but users don't have enough permissions.
This is also going to be a great prank on the next engineer who's never gonna figure why he can't see any of the files the build jobs are seeing in his shell.
It will have expired by now, but ClearCase is exactly what I think of whenever I see these kinds of ideas come up. It's really a handy tool but too bad almost nobody uses it or has even heard of it in the open source world since it isn't free (in any sense of the word). They were still using it at the NRO's ground processing stations as recently as six years ago. Just rsync VOBs between dev, staging, and prod environments, and checkout a particular view to install upgrades and you're guaranteed to have a totally identical environment complete with all dependencies, no containers necessary. It's really better than Git for this, too, because it can work as a distributed filesystem across many hosts at once, handles binary files perfectly well without needing extensions, uses a real database. You can version control an entire cluster of servers the same way Git version controls a single software project.
But it's 90s IBM enterprise business model to the core and the rest of the Rational product suite sucks.
Yes, there’s a 20 year maximum duration for patents. Depending on some technicalities it could be a year or two longer but definitely under 26. Unless someone created additional patents with improvements.
Yes, it is before git was created; that doesn't matter for a patent. In ClearCase, the repository is mounted on the filesystem very much like this; in Linux, it required special kernel drivers to work back when I used it (2000s).
> A file is further divided into blocks with variable lengths. We use Content Defined Chunking algorithm to divide file into blocks.
> This mechanism makes it possible to deduplicate data between different versions of frequently updated files, improving storage efficiency. It also enables transferring data to/from multiple servers in parallel.
I use it on old PC without issue. Drawback: since the files are not stored in clear, in case of data corruption of the Seafile repositories, I need backup (never happened to me).
* https://manual.seafile.com/develop/data_model/
* https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf