
Keep Your Stuff, for Life - vincent_s
https://perkeep.org/
======
CGamesPlay
Wow, it was really hard to find out what this software does. I finally found a
demo at the end of an hour-long talk from 2 years ago:
[https://youtu.be/PlAU_da_U4s?t=2687](https://youtu.be/PlAU_da_U4s?t=2687)

So it seems to have a bunch of scripts to import data into its database from a
variety of sources (including cloud services), and provides a search interface
to navigate that historical data. And it has a lot of under-the-hood stuff
about replication, and it's entirely self-hosted.

~~~
taneq
Does the first sentence not cover that fairly well?

> Perkeep (née Camlistore) is a set of open source formats, protocols, and
> software for modeling, storing, searching, sharing and synchronizing data in
> the post-PC era. Data may be files or objects, tweets or 5TB videos, and you
> can access it via a phone, browser or FUSE filesystem.

I checked archive.org and that text has been there for a couple of months at
least. Looks interesting, I've been in the market for this kind of self hosted
backup/replication/tagging/search thing.

~~~
generalk
> Does the first sentence not cover that fairly well?

It does not.

The first sentence on the site ("Perkeep [...] is a set of open source
formats...") describes literally what the thing is, but not at all what it
_does_.

Not to slam on these cats, because marketing copy is _hard_. For project
collaborators, or open-source dorks who live in this kind of world anyway, the
sentence on the homepage is probably perfectly descriptive.

But I agree with my GP post. Reading the homepage I had no idea what Perkeep
actually did.

~~~
taneq
What it _does_ is:

> modeling, storing, searching, sharing and synchronizing [...] files or
> objects, tweets or 5TB videos, and you can access it via a phone, browser or
> FUSE filesystem

I mean maybe it could have been more explicit or they could have added more
detail, but having this as the first sentence is WAY better than most of the
'professional' landing pages for startups that get posted here. 'Harmonizes
synergy and increases your ability to wow your target space with your
aspirations', now _that 's_ meaningless.

~~~
js2
Them: “The project description isn’t clear to me.”

You: “Well I’m sorry it wasn’t clear to you but it was clear to me and better
than these other things and here’s why it should have been clear to you.”

If someone tells you something is unclear to them, arguing about it doesn’t
change the fact that it wasn’t clear to them.

------
solarkraft
Perkeep is super interesting. I have been looking at it as a media database
(photos, movies and such, powerful tag-based search, all downloaded on
demand), but some things regarding that really hurt:

\- The data store messes with my files (yes, there's a FUSE mount, but eh,
having to adopt a special data store always makes me feel weird since it
usually comes with performance and compatiiblity implications, there are also
many other block-chopping data stores, for example in IPFS).

\- The last time I checked there was no way to delete something. This is okay
for Tweets I guess, but if I commit a 3h video I later realize to just be too
large or a photo I end up not really wanting around - well, oops.

I have huge respect for Brad Fitzpatrick in general, of course, and especially
for creating this. Recent velocity has seemilgly been relatively low, however:
[https://news.ycombinator.com/item?id=22161812](https://news.ycombinator.com/item?id=22161812)

~~~
andai
> ... my time is tempered by 2.5 and 0.5 year old kids. ... I'll pick up my
> involvement again as kids get a bit older.

> We have no plans to abandon it.

------
ahupp
It sounds like this preserves your data from lots of different services. This
is something I need! But I couldn't figure out what it actually supports.
Suggestion: the very first paragraph should describe the specific inputs it
can handle.

~~~
eexan
For the most part, it's just an object storage (think Amazon S3). Content
addressable (think Git): you put an object (file bytes) in, and you can get it
out by its hash, that's it.

There are some bits (permanodes and claims) for adding metadata to objects
(filename, timestamp, geo location and other attributes, I think even
arbitrary jsons) and for authentication/sharing. A few really cool bits around
modularity: blob servers can be composed over network - you can transparently
split your blob storage over multiple machines, databases, cloud services, set
up replication, maybe encryption (unclear to me if it works or not).

Importing data from different services is not really its core competency, at
least not yet. It can ingest anything you can put on your file system and
there are importers for a few third-party services (see
[https://github.com/perkeep/perkeep/tree/master/pkg/importer](https://github.com/perkeep/perkeep/tree/master/pkg/importer)
), but that's about it

~~~
kemonocode
Thank you so much for a description of what it _actually_ does which the
website seems to struggle so much to convey.

One thing that I'm still trying to figure out is, if you do happen to know:
how does it handle data deduplication (if at all)? How about redundancy and
backups? I've been glancing over the docs and I do see mention of replication
to another Perkeep instance but that's not quite what I'm looking for.

~~~
eexan
Deduplication is naturally handled by content-addressable property of this
object store: the address of each object is its cryptographic hash, SHA224 in
Perkeep. So if you try to put a duplicate copy, you'll find that the address
at which you try to put it at is already occupied by the first copy. Perkeep
assumes that you never delete anything (deletion is even simply not
implemented, not even for garbage collection/compaction purposes), so if you
see that one copy of an object was already put, you can discard any further
puts as no-ops.

Then there is also some logic to chunk large objects into small pieces or
"blobs". These small chunks are actually what the storage layer works with,
rather than with the original unlimited-length blobs that the user uploaded.
Chunking helps to space-efficiently store multiple versions of same large file
(say, a large VM image) - the system only needs to store the set of unique
chunks, which can be much smaller than N full but slightly-different copies of
the same file. But I personally I find that it deteriorates its performance to
the point of making it unusable for my use case of multi-TB multi-million-
files storage of immutable media files. If chunking/snapshotting/versioning is
important for your use case, I'd look more towards backup-flavored tools like
restic, which share many of these storage ideas with Perkeep.

Redundancy and backup is handled by configuring storage layer ("blobserver")
to do it. Perkeep's blobservers are composable - you can have leaf servers
storing your blobs, say, directly in a local filesystem directory, remote
server over sftp, or an S3 bucket, and you can compose them using special
virtual blobserver implementations into bigger and more powerful systems. One
such virtual blobserver is
[https://github.com/perkeep/perkeep/blob/master/pkg/blobserve...](https://github.com/perkeep/perkeep/blob/master/pkg/blobserver/replica/replica.go)
\- which takes addresses of 2+ other blobservers and replicates your reads and
writes to them.

~~~
eexan
Backup as in backing up one perkeep instance to another is the "pk sync"
command
([https://github.com/perkeep/perkeep/blob/master/cmd/pk/sync.g...](https://github.com/perkeep/perkeep/blob/master/cmd/pk/sync.go)).

You give it the addresses of source and destination blobservers, it enumerates
blobs in both, and copies the source blobs missing from destination into the
destination server.

------
na85
I admire the goal of reliable long term data storage. I don't need to store
tweets or 5TB videos, but my current solution is duplicate DVDs with a bunch
of par2 files to hopefully ward off bit rot.

I feel like there's some room for improvement.

~~~
denkmoon
You can buy a 2tb drive for $100. $200 and you've got 430 DVD's worth of data
redundantly stored. $300 and you've got local redundancy AND offsite backup.

~~~
paranoidrobot
Even if magnetically they don't have 'bit rot', they use bearings where the
lubrication can dry up and wear out when they're not turning for long periods
of time.

You need to keep them spinning on a regular basis, and replace them as they
begin to fail.

~~~
dannyw
HDDs are also prone to silent bitrot, where it will simply return incorrect
bytes for a sector, even without any smart errors. (Optical disks also bitrot;
but so does HDDs).

This is usually a precursor to SMART errors happening in the near future, but
unfortunately, it can still result in corrupted replication and corrupted
backups; as your backups would be backing up the rotten (corrupt) data.

I've witnessed this happen on both Seagate and WD drives, on systems with ECC
memory. I can only suspect this is due to HDD manufacturers wanting to reduce
their error rates, and RMA rates: it may happen with the ECC bits in a sector
is corrupt, making bitrot undetectable. Instead of giving an error (and being
grounds for a RMA replacement), the HDD firmware may choose to return non-
integrity-checked data; which would usually be correct but also could be
corrupt.

It's why filesystems like ZFS and btrfs are so important.

My rough estimation of this, based on my own experiences and those on
r/DataHoarder, suggests 1 hardware sector (4KB for most drives post 2011) will
_silently_ corrupt per 10TB year. Such corruption can be detected via
checksumming disc systems like ZFS.

Usually, the whole sector is garbage, which is not indicative of cosmic ray
bitflips.

External flash memory storage like USB sticks and SD cards fare far worse. In
my own experience, silent corruption occurs more like 1 file per device, per
2-3 years; irrespective of the size of memory. I've had USB sticks and SD
cards return bogus data without errors, so often. I only know because I
checksum everything, otherwise I would have thought the artefacts in my videos
or photos came with the source.

If, in 2020, you are not using ZFS or btrfs for long term archival, you are
doing something wrong.

ext4, NTFS, APFS, etc may be tried and tested, but they have no checksumming,
and that is a problem.

~~~
mnw21cam
Interestingly, on my home ZFS raidz with 3 4TB hard drives, I have had to
replace a drive a couple of times because ZFS scrub was reporting silent
corruption. They were consumer-grade SATA drives.

However, at work, I have backed up ~200TB of data to a large server with
RAID-6 and ext4, storing the backups as large .tar files with par2 checksums
and recovery data, and regularly scrubbing the par2 data. I have yet to see
any corruption whatsoever. These are enterprise-grade hard drives. This is the
strongest evidence I have yet seen that the enterprise-grade drives are
actually better than the consumer-grade ones, rather than just being re-
badged.

~~~
lostlogin
Thanks. What are the drives at your workplace?

~~~
mnw21cam
I actually have no idea. I didn't have any part in purchasing that particular
system, I don't have root, and all the drives are hidden behind a RAID
controller. Sorry.

~~~
LgWoodenBadger
How do you know they are enterprise drives then?

------
etskinner
How is this better than a plain old filesystem with bitrot compensation? What
does this have that my btrfs or ZFS filesystem (with parity) doesn't?

------
cpach
Does anyone know how mature Perkeep is? Is anyone using it regularly? Would
love to hear if there is anyone who has experience with it.

~~~
eexan
Very immature, just have a look at its extensive absent documentation. The
best bit that describes the state of things: "If you're a programmer or fairly
technical, you can probably get it up and running and get some utility out of
it".

So much to show off for 7 years of development, I'm pretty skeptical of its
future. But some of the ideas are pretty cool, like composable blob servers.

------
brudgers
past discussion,
[https://news.ycombinator.com/item?id=18008240](https://news.ycombinator.com/item?id=18008240)

------
alexr243
There's a blockchain startup called arweave which is trying to do exactly
this.

------
eexan
I tried perkeep a while ago. While the ideas are cool, the implementation is
meh:

I added a single 2.7G ubuntu iso - it took 5 minutes to ingest it (on a
tmpfs!), and turned it into 45k(!) little chunks, wtf is up with that? At this
rate indexing my multiple terabytes of data is going to take it days and I
don't even want to think how much seek time it's going to need to store its
repo on a spinning HDD.

~~~
shock
> I added a single 2.7G ubuntu iso - it took 5 minutes to ingest it (on a
> tmpfs!), and turned it into 45k(!) little chunks, wtf is up with that?

Ingest times correlate linearly with file sizes because it needs to compute
the blobref (which is a configurable hash) for all the blobs (chunks as you
call them). Splitting in blobs/chunks is necessary because a stated goal of
the project is to have snapshots by default when modifications are done. Doing
snapshots/versioning without chunking would be very inefficient.

~~~
eexan
Reading the docs, snapshotting/versioning doesn't strike me as a major feature
of perkeep. It's more important and appropriate in the domain of backup
software (e.g. restic/attic/borg) and where you'd want it together with delete
functionality to reclaim space.

But perkeep's focus, as I understand it, is more on managing an unstructured
collection of immutable things (e.g. photo archive), rather than being a tool
to back up your mutable filesystem. So I'm not sure they made a good design
decision to chunk the sh*t out of my files, which really kills the performance
on large files and especially on spinning disks.

------
natural219
Ah, nostalgia :).

It seems like with the recent wave of news about social media migrations
(reddit, facebook, twitter, twitch, tiktok), people are hopefully starting to
get more and more warmed up to the idea of protocolization of their social
data.

But most of the projects doing it are still just too immature. Solid, Perkeep,
Blockstack, etc. just seem like vaporware.

Seems like the only serious projects in use are Matrix, Urbit, and
ActivityPub/Mastodon. But I haven't checked in with the decentralization scene
in a while.

~~~
lukecameron
I want that protocolization too, although I don't hold out much hope that the
monopolies in place can be broken, outside of fairly radical regulation.

To add to your list, there is also Secure Scuttlebutt [1] which has had a
decent userbase over the past few years, and Planetary [2] which is a funded
iOS client for it.

I think in general they all suffer from the chicken-egg problem and will need
some reason for enough people to switch to be able to build a userbase. There
isn't really any "novel hook" like tiktok, twitter, whatsapp, instagram,
snapchat, etc have had in the past.

[1] [https://scuttlebutt.nz/](https://scuttlebutt.nz/)

[2] [https://planetary.social/](https://planetary.social/)

~~~
asdkhadsj
Man, I love the idea of Scuttlebutt but I hate the developer UX. I'm writing
some apps that I wanted to put on SSB but have all but given up on the idea.
Something about SSB, as a dev, leaves me with a lot of questions and no idea
where to even get answers from.

So I'll write my app outside of SSB, hopefully in a way that's mostly
compatible, and possibly with future integration.

I may also toy with an SSB-like protocol myself, as the fundamentals of SSB is
a work of art imo. I really enjoy what Gossip brings to the table, and how SSB
focuses on human->human relationship to bring P2P to the table.

------
darkwater
Last released version 0.10 is from May 2018. Is this project still alive? Last
commit is from March 11 2020 so maybe they are just "slow" at releasing.

~~~
PurpleRamen
Last commit was just a simple bugfix in documentation. The last juicy commits
were in december 2019 and [https://github.com/perkeep/perkeep/graphs/code-
frequency](https://github.com/perkeep/perkeep/graphs/code-frequency) makes the
impression that activity has slowed down in 2019 significantly. Maybe they
reached the point of good enough, but the high number of open issues and
mergerequests is still problematic.

~~~
andai
They reached the point of having kids :)

[https://news.ycombinator.com/item?id=22161812](https://news.ycombinator.com/item?id=22161812)

------
alexr243
There's a blockchain startup called arweave which is trying to do exactly this

------
0xCMP
For those curious "what does this do" and "how does this work" the
presentation on the home page (Video + Slides) helps a lot.

------
crooked-v
It looks like this hasn't had any updates since May 2018.

~~~
ramzeus
The GitHub page has lots of fairly recent updates:
[https://github.com/perkeep/perkeep](https://github.com/perkeep/perkeep)

~~~
vaughnegut
Last commit was in March, mind you. It looks like an interesting project
though

~~~
tptacek
I imagine the author is pretty busy with Tailscale right now.

~~~
Intermernet
And kids ;-)

------
mam2
That would be nicer to be able to "delete all your stuff, for life" in our era
than to keep it

------
heybrandons
Watching the LinuxFest talk now, this looks really neat!

------
aabbcc1241
exporting/mirror content from clearnet social network to zeronet is also
interesting.

For popular content, you'll see it has many seeds, instead of likes

------
betimsl
How is this different from Upspin?

------
Aperocky
It's like a poor man's version of AWS Glacier

