If you cp your data onto a Plan9 machine, what results is pretty much exactly the process you've outlined.
Plan9's default filesystem is made up of two parts: Fossil, and Venti.
- Fossil is a content-addressable on-disk object store. Picture a disk "formatted as" an S3 bucket, where the keys are strictly the SHAsums of the values.
- Venti is a persistent graph database that holds what would today be called "inode metadata." It presents itself as a regular hierarchical filesystem. The "content" property of an inode simply holds a symbolic path, usually to an object in a mounted Fossil "bucket."
When you write to Venti, it writes the object to its configured Fossil bucket, then creates an inode pointing to that key in that bucket. If the key already existed in Fossil, though, Fossil just returns the write as successful immediately, and Venti gets on with creating the inode.
Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. (Microsoft attempted this with WinFS, but mysteriously failed.) Is it just inertia? Why are we still creating new filesystems (e.g. btrfs) that don't follow this design?
For my understanding: what happens if you open a file, change one byte and close it again? Since the SHAsum of the contents has changed, is the entire file now copied?
Only the block where the byte resides. A block is typically 512b to 4096b. (So it's not that unlike a normal drive, where you also have to rewrite an entire sector even if just a byte changed)
Venti doesn't know about files, it only knows about blocks of data. It's a key/value store where the key is sha-1 and the value is a block(blob) of data.
The filesystem running on top of Venti will ask Venti to store the new block where you changed the byte and the filesystem will update the metadata that assembles all blocks to a file.
1. Plan9 has no hard links so if you copy unix dir. tree to a plan9 machine you'd lose all the hard link info.
2. Venti doesn't use fossil or inodes. Venti is just content addressable storage system; not a fileserver/system.
3. Fossil is a fileserver.
> Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. Is it just inertia?
The penalty for doing content addressed filesystems is of course the CPU usage. btrfs probably has most of the benefits without the CPU cost with its copy-on-write semantics.
Note that what you describe (and my initial process) is a different semantic than hard-links. What you get is shared storage but if you write to one of the files only that one gets changed. Whereas with hardlinks both files change.
In effect, hard links (of mutable files) are a declaration that certain files have the same "identity." You can't get this with plain Venti-on-Fossil, but it's a problem with Fossil (objects are immutable), not with Venti.
Venti-on-Venti-on-Fossil would work, though, since Venti just creates imaginary files that inherit their IO semantics from their underlying store, and this should apply recursively:
1. create two nodes A and B in Venti[1] that refer to one node C in Venti[2], which refers to object[x] with key x in Fossil.
2. Append to A in Venti[1], causing a write to C in Venti[2], causing a write to object[x] Fossil, creating object[y] with key y.
3. Fossil returns y to Venti[2]; Venti[2] updates C to point to object[y] and returns C to Venti[1]; Venti[1] sees that C is unchanged and does nothing.
Now A and B both effectively point to object[y].
(Note that you don't actually have to have two Venti servers for this! There's nothing stopping you from having Venti nodes that refer to other Venti nodes within the same projected filesystem--but since you're exposing these nodes to the user, your get the "dangers" of symbolic links, where e.g. moving them breaks the things that point to them. For IO operations they have the semantics of hard links, though, instead of needing to be special-cased by filesystem-operating syscalls.)
Content addressable systems trade CPU and memory with disk space. If you expect duplications to be low, you are usually better off with a background scrubber.
Plan9's default filesystem is made up of two parts: Fossil, and Venti.
- Fossil is a content-addressable on-disk object store. Picture a disk "formatted as" an S3 bucket, where the keys are strictly the SHAsums of the values.
- Venti is a persistent graph database that holds what would today be called "inode metadata." It presents itself as a regular hierarchical filesystem. The "content" property of an inode simply holds a symbolic path, usually to an object in a mounted Fossil "bucket."
When you write to Venti, it writes the object to its configured Fossil bucket, then creates an inode pointing to that key in that bucket. If the key already existed in Fossil, though, Fossil just returns the write as successful immediately, and Venti gets on with creating the inode.
Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. (Microsoft attempted this with WinFS, but mysteriously failed.) Is it just inertia? Why are we still creating new filesystems (e.g. btrfs) that don't follow this design?