Hacker News new | past | comments | ask | show | jobs | submit login

Deduplication is naturally handled by content-addressable property of this object store: the address of each object is its cryptographic hash, SHA224 in Perkeep. So if you try to put a duplicate copy, you'll find that the address at which you try to put it at is already occupied by the first copy. Perkeep assumes that you never delete anything (deletion is even simply not implemented, not even for garbage collection/compaction purposes), so if you see that one copy of an object was already put, you can discard any further puts as no-ops.

Then there is also some logic to chunk large objects into small pieces or "blobs". These small chunks are actually what the storage layer works with, rather than with the original unlimited-length blobs that the user uploaded. Chunking helps to space-efficiently store multiple versions of same large file (say, a large VM image) - the system only needs to store the set of unique chunks, which can be much smaller than N full but slightly-different copies of the same file. But I personally I find that it deteriorates its performance to the point of making it unusable for my use case of multi-TB multi-million-files storage of immutable media files. If chunking/snapshotting/versioning is important for your use case, I'd look more towards backup-flavored tools like restic, which share many of these storage ideas with Perkeep.

Redundancy and backup is handled by configuring storage layer ("blobserver") to do it. Perkeep's blobservers are composable - you can have leaf servers storing your blobs, say, directly in a local filesystem directory, remote server over sftp, or an S3 bucket, and you can compose them using special virtual blobserver implementations into bigger and more powerful systems. One such virtual blobserver is https://github.com/perkeep/perkeep/blob/master/pkg/blobserve... - which takes addresses of 2+ other blobservers and replicates your reads and writes to them.




Backup as in backing up one perkeep instance to another is the "pk sync" command (https://github.com/perkeep/perkeep/blob/master/cmd/pk/sync.g...).

You give it the addresses of source and destination blobservers, it enumerates blobs in both, and copies the source blobs missing from destination into the destination server.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: