Hacker News new | past | comments | ask | show | jobs | submit login

For the most part, it's just an object storage (think Amazon S3). Content addressable (think Git): you put an object (file bytes) in, and you can get it out by its hash, that's it.

There are some bits (permanodes and claims) for adding metadata to objects (filename, timestamp, geo location and other attributes, I think even arbitrary jsons) and for authentication/sharing. A few really cool bits around modularity: blob servers can be composed over network - you can transparently split your blob storage over multiple machines, databases, cloud services, set up replication, maybe encryption (unclear to me if it works or not).

Importing data from different services is not really its core competency, at least not yet. It can ingest anything you can put on your file system and there are importers for a few third-party services (see https://github.com/perkeep/perkeep/tree/master/pkg/importer ), but that's about it






Thank you so much for a description of what it actually does which the website seems to struggle so much to convey.

One thing that I'm still trying to figure out is, if you do happen to know: how does it handle data deduplication (if at all)? How about redundancy and backups? I've been glancing over the docs and I do see mention of replication to another Perkeep instance but that's not quite what I'm looking for.


Deduplication is naturally handled by content-addressable property of this object store: the address of each object is its cryptographic hash, SHA224 in Perkeep. So if you try to put a duplicate copy, you'll find that the address at which you try to put it at is already occupied by the first copy. Perkeep assumes that you never delete anything (deletion is even simply not implemented, not even for garbage collection/compaction purposes), so if you see that one copy of an object was already put, you can discard any further puts as no-ops.

Then there is also some logic to chunk large objects into small pieces or "blobs". These small chunks are actually what the storage layer works with, rather than with the original unlimited-length blobs that the user uploaded. Chunking helps to space-efficiently store multiple versions of same large file (say, a large VM image) - the system only needs to store the set of unique chunks, which can be much smaller than N full but slightly-different copies of the same file. But I personally I find that it deteriorates its performance to the point of making it unusable for my use case of multi-TB multi-million-files storage of immutable media files. If chunking/snapshotting/versioning is important for your use case, I'd look more towards backup-flavored tools like restic, which share many of these storage ideas with Perkeep.

Redundancy and backup is handled by configuring storage layer ("blobserver") to do it. Perkeep's blobservers are composable - you can have leaf servers storing your blobs, say, directly in a local filesystem directory, remote server over sftp, or an S3 bucket, and you can compose them using special virtual blobserver implementations into bigger and more powerful systems. One such virtual blobserver is https://github.com/perkeep/perkeep/blob/master/pkg/blobserve... - which takes addresses of 2+ other blobservers and replicates your reads and writes to them.


Backup as in backing up one perkeep instance to another is the "pk sync" command (https://github.com/perkeep/perkeep/blob/master/cmd/pk/sync.g...).

You give it the addresses of source and destination blobservers, it enumerates blobs in both, and copies the source blobs missing from destination into the destination server.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: