Hacker News new | past | comments | ask | show | jobs | submit login
Perkeep: personal storage system for life (perkeep.org)
246 points by setra on Sept 17, 2018 | hide | past | favorite | 62 comments

https://perkeep.org/ is the new(ish) name for Camlistore, created by Brad Fitzpatrick and with a lot of active developers.

From the home page (rather than the linked overview):

> Perkeep (née Camlistore) is a set of open source formats, protocols, and software for modeling, storing, searching, sharing and synchronizing data in the post-PC era. Data may be files or objects, tweets or 5TB videos, and you can access it via a phone, browser or FUSE filesystem.

Things Perkeep believes:

+ Your data is entirely under your control

+ Open Source

+ Paranoid about privacy, everything private by default

+ No SPOF: don't rely on any single party (including yourself)

+ Your data should be alive in 80 years, especially if you are

> Your data should be alive in 80 years, especially if you are

How do they deal with obsolescence?

Software that used to exist 50 years ago doesn't run today, and most of those formats (if they aren't text formats) are either obsolete or completely unsupported. Emulators exist, but nobody actually uses it. Part of this is because software becomes obsolete over time, and part of that is because hardware becomes obsolete.

How are they going to make software today that will run on new computers in 80 years, or how will they make software and data formats backwards compatible for 80 years?

> Emulators exist, but nobody actually uses it.

Sure, "nobody" (i.e. a negligible number of people) is running emulations of consumer software, esp. non-networked consumer software.

Networked backend server software, on the other hand, is run under emulation in production all the time. It's roughly 80% of the point of IBM's z/OS product line: to continue providing backward-compatibility with their mainframes all the way back through the early 70s, by shipping hardware that runs a hypervisor that can continue running those old workloads under (accelerated!) emulation, without changes. Anyone business running "a mainframe" these days isn't running on the original hardware (which has long since broken down without component replacement availability), but rather running modern hardware that's emulating their original mainframe.

I suspect that any p2p data-storage network that achieves importance and has data an archivist would care about living on it, would be given the same treatment (if people don't just consistently write new clients for it on new platforms.)

I sympathize with your skepticism but I think 1968 is so quantitatively and qualitatively different that it's not a very helpful comparison.

In 1968 nobody had personal computers, they were not a thing. ASCII is still really new, "files" aren't really a thing yet, the Multics system is under development and nobody has yet made the pun "Unix" let alone named an operating system.

What formats are you thinking of that weren't text formats but are now "obsolete or completely unsupported" ? The Joint Technical Committee (home of JPEG, MPEG, and so on) isn't even an _idea_ yet, many of the people who'll form this committee are undergraduates or still in school. Machines aren't storing pictures, they're barely storing meaningful text, it's mostly numbers, big calculations.

If we ask about 40 years ago instead, things are hugely different. By this point Unix exists, ADVENT exists, ASCII has "won". There is no Internet, no X Window System yet, and there still isn't a Joint Technical Committee but already the documents, software and systems are familiar because we're still using them. At home there is Pong, and in pinball arcades the new Space Invaders, both are nicely emulated today.

> I sympathize with your skepticism but I think 1968 is so quantitatively and qualitatively different that it's not a very helpful comparison.

It's sort of like automobiles in 1968 advertising how they are made with care and detail so they'll last, and made to be easy to work on so you can expect them to actually have people (or yourself) that know how to fix them decades later. People could easily come out and say most of what made a car in 1918 was very different to then, all the way down to the tires themselves. Industries that have had multiple decades of general use mature quite a bit, and people don't like to throw away stuff that works (or that they're fond of). We'll still have computers capable of running a von neumann architecture in 50 years, whether through hardware or software, and that's assuming we can't just port/compile to newer systems if they aren't as extreme of departures.

I still occasionally play computer games written in the 1980's, generally through dosbox or something similar. I think the most likely reason we have to lose access to running this software is if we lose access to running all software, in which case nobody will really care (not that I think that's remotely likely, just that it's the most likely scenario where that holds).

They've made fairly boring choices to store data which makes it fairly easy to rewrite perkeep in the future if that were to happen. However, it's likely that Go will be maintained in to the future.

Perkeep's format is basically chunks of files ("blobs") named after their sha256 hash which can be reindexed as needed. So while the files stored may require software which could be gone the files and etc. will be there in the worst case that the project disappears.

and metadata is stored in plain boring json because of the same reason

> Emulators exist, but nobody actually uses it.

Disagree; I think the strongest example is DOSBox, through which DOS programs, of all things, are actually one of the least common denominators across an astonishingly wide set of platforms. Honestly, if I had to pick a format to use today that needed to hit as many platforms as possible and last as long as possible, I'd probably pick DOSBox, which is portable to Android, GNU/Linux, Darwin, NT, gaming consoles (at least Wii and Nintendo DS).... oh, Wikipedia actually has a better list: https://en.wikipedia.org/wiki/DOSBox#Ports

Anyways, I'll concede that emulators aren't as popular as native apps on most platforms, but they certainly hold their own, especially for archival purposes.

> Emulators exist, but nobody actually uses it.

They're very popular for games, and can be used to get old files out of many popular computers of the 80s.

I'd encourage you to check out this idlewords talk/blog "Web Design: The first 100 years" [0]

It presents a pretty compelling argument for why technology has a tendency to level out. I for one am confident that x86-64 binaries will still be running 50 years from now out of sheer inertia and the lack of any real practical jump in technology. (Computers of today are the 747s of 1969: good enough for almost everyone).

[0] http://idlewords.com/talks/web_design_first_100_years.htm

Heh, I am glad to hear that it's the new name... I was going to feel kinda bad pointing out that Camlistore has been exactly what was built already for awhile.

I tried Camlistore a bit a couple years ago and it was neat, but still pretty early. And then it looked like no development was happening on it. I would have helped, but I believe it's in go which I am not experienced with. Does this name change come with a new release? Is Perkeep 0.1 markedly different from the previously available public release of Camlistore?

Not only do I like this, I like their comparison page: https://perkeep.org/doc/compare

My own (slightly out of date) comparison list: https://github.com/pjc50/pjc50.github.io/blob/master/secure-...

I've recently started using Fossil[0] to archive all my personal data. It works rather brilliantly. Technically you can use any VCS but Fossil is unique in that the entire repo is a single SQLite db, so it's very easy to backup and restore. Not to mention the web UI to have a quick glance before checking out any files. Even better I can sync flawlessly between multiple hard drives and computers. I've a few separate branches for Docs/Photos etc. I checkout the related branch and just add more files whenever needed. After files are added to the repo, I just remove the working copy. There are some limitations though like files larger than 2GB aren't supported.

[0] - https://www.fossil-scm.org/

Fossil's also very easy to put online, needing at a minimum a two-line bash file to function as a CGI script.

Maybe more relevant to private data, the builtin wiki makes a good personal knowledge database.

The next version of fossil will have a forum (seen already at https://fossil-scm.org/forum/forum ). With the time sorting for threads, that might be good for temporal data that you wouldn't want to put in a wiki.

Something I didn't know is that the US Library of Congress considers SQLite databases to be long-term storage formats, on par with CSV and JSON: https://www.sqlite.org/locrsf.html

I do the same thing (store all my personal data in Fossil). I like that I can save a file in markdown format and it's rendered automatically when I view the file online. (Not unlike Github, but Github is a proprietary service.)

For my purposes, I don't see any advantage of Perkeep over Fossil. I know when I use Fossil that I can trust my system and that I will always have control of my data, and that reduces my stress levels. I have enough things to worry about without worrying about my data disappearing.

I don't use the feature very often, but Fossil supports unversioned files, which allows me to delete a 500 MB file from the repo if I no longer need it.

After your comment on file storage, and star-techate's response to you about using it as a personal knowledge database, I am surprised that I don't hear about Fossil more often.

TDLR: This is a content addressed data store similar to IPFS (although this project is older). You can configure one of several backends such as local file storage, S3, SSH, etc. It includes an organization system based on tags, and other meta data. You can construct a fuse filesystem representation based on a query. A web UI exists allowing exploration of existing files, uploading, etc.

I'm looking for something like perkeep, but with the ability to add (scientific) metadata. Oftentimes when doing science for the university your research fund is attached with clauses that obligate you to store all data of your research for a timespan of 10-20 years and to do (who would have guessed) scientific research - which entails saving information with every data point: When was the data obtained, how was it obtained, who generated it, for which experiment, what's the copyright on this, is it anonymized, pseudonymised, is it connected to any other research, what's the doi/arxiv/ark-id connected to it,....

An archive where you drag and drop your files that can upload everything to a s3 storage (no not amazon s3) and tag metdata to it would be a dream. Right now there is no good solution for this and in the beginning I took a deep look at camlistore and hoped for a solution in it. (I looked at upspin, ipfs and other solutions as well). If someone as a solution for this or if perkeep could be expaned (or has the option somehow hidden somewhere) I would be very happy if somebody could point me in the right direction.

It seems weird that deletion is prohibited. As we grow as people, sometimes we no longer want to associate something with ourselves. A photo we don't want to remember, for instance. This feels like an unnecessary restriction.

> no delete support

Yeah that's a show stopper. There's just way too many scenarios where's you need to delete something.

For instance if forced by law.

Delete is not (or only very poorly) supported in git as well. For almost all use cases this is correct way.

Perkeep is for single users so the use case to compare with is a private git repo.

If you're not publishing anything, reverting the last change is easy and a rebase isn't that hard.

You "can't delete" but there are in-fact ways to clean up deleted items and get rid of them completely.

If there are no references to the GUID of the item, you could easily simply consider it deleted. There would be no practical way to find it, after all. There is no "list all" or anything like that (as I understand it). You can't say 'Show me how much of the 128 bit space I've filled' and then iterate through each piece..

key words:

1. child porn

2. steganography

if I were the dev I'd add a 'list all' just to avoid anyone thinking the above were a good plan.

I don't think it's prohibited, see linked issues below. I have been keeping an eye on the project and that feature for several years. I am using zfs snapshots locally and borg backup for now. It is not comparable but does the job for now.

https://github.com/perkeep/perkeep/issues/792 https://github.com/perkeep/perkeep/issues/1076

The bottom of page says last updated in 2013, but the name has been changed and the latest version does seem to be 0.10. This was previously called camlistore.

Is it still the case that you can't delete anything? Although rarely needed, that seems like a showstopper these days. Irreversible actions are bad UI.

I think bolting on an equivalent of deletion is not very hard.

(1) Make sure that a chunk is unlinked from everywhere. (2) Overwrite it with random data, or plainly delete from store.

I suppose there's a way to relatively painlessly find out which chunks contain a particular item, and target them.

Not to mention a violation of GDPR.

Software is not a violation of the GDPR. GDPR means you can not use it for some things, but given the focus on a personal storage system it's less relevant.

Yes software itself is safe. But eventually you'll want this stored online. At that point, the company hosting your data will be obligated to comply, but by design, cannot. In that sense, it's worse than simply incompliant. It's virally incompliant. Any software that uses it will also be affected.

The GDPR is effectively irrelevant here unless your goal is to host Perkeep as a service. Yes, if you upload your own personal database to Dropbox then Dropbox does still have GDPR obligations to you but those obligations do not extend to managing your files for you also, as an analogy if you were to upload a zip file to Dropbox it would not be reasonable to expect them to remove a file from within that zip file at your request.

So, you're suggesting this software would prevent a hosting provider from deleting your account data because you've used it for Perkeep data? I can't begin to comprehend the thought process that leads you there.

So you are going to sue yourself for this GPDR violation on a personal storage system?

Does the GRPR apply to single-user, standalone products where there's no company providing a service?

Of course! Transparent home phoning would be a GDPR violation for example.

Is there a user guide anywhere?

I'm having trouble finding one. The "Getting Started" page just says "run the daemon" and not much more. There are pages on how to set the many configuration options.

What if I just want to use Perkeep, or find out what the experience of using it is like? Is there a friendly walkthrough or tutorial? Or an introduction to the concepts one needs to understand as a user, not as a developer?

Looks like a pretty interesting project, and it's been consistently worked on for seven years, which is definitely something:


Anyone have a testimonial from the perspective of a user or hacker on it?

It's really worth thinking about the idea of not having filenames by default. They give a good example: if you take photos you don't want to name them, instead you want automatically collected metadata (like creation time) and some UI for easily searching by that metadata.

So it's basically a correct idea, but I want to know what is needed to make it work.

I remember the Palm Pilot tried to do this by pretending not to have files, and having "databases" instead. The result was that the palm-pilot database just became an obscure, inconvenient file format.

On the other hand, modern big giant internet storage service do a pretty good job of "freeing" you from filenames, letting you get photos, docs stuff.

On the other, other, hand, there might be something about the personal aspect of perkeep that makes it more like the palm-pilot.

'Designing better file organization around tags, not hierarchies' by Nayuki [1] (HN thread [2]) is an essay about such filename-less, tagging-based systems, covering prior art, technical background, and user experience. It's a thorough look at what has already been done, what works, what's clunky, and brainstorms about where we can still go.

[1] https://www.nayuki.io/page/designing-better-file-organizatio... [2] https://news.ycombinator.com/item?id=16763235

The reason for a filename is identity. This might be automatically assigned based on metadata (e.g. creator+date+index), but it's definitely necessary.

Right, so to be clear by "filename" I did mean something like "filename the user actually cares about".

Almost any database (including a filesystem) has a primary key, which can be thought of as a file-name. Filesystems are unusual in that ordinary users sometimes want to explicitly deal with the records (files) and their keys (names).

There was some discussion earlier about the former Camlistore, and how it differs from the Upspin project in a couple threads here (https://news.ycombinator.com/item?id=13700492) but maybe the authors can chime in here and restate what the different usecases would be between Upspin and Perkeep -- it seems like they are targeting the same audience: personal users wanting to back up data. The biggest point of emphasis is that these are not to be used for enterprises, and using them as such would be an anti-pattern, but curious as how the breakdown goes after that.

This was answered by bradfitz himself https://news.ycombinator.com/item?id=13700968

Where does it store the data?

Seems to be local disk or Amazon S3.


The thing about files is that they are never going away and they are simple like a rock. If you want to avoid any type of lock-in ever, just store things in files.

I think you're making a category error: files are an interface, they don't actually store anything (the underlying filesystem may or may not do that). Obvious counterexamples to "just store things in files" are /proc on Linux, pifs ( https://github.com/philipl/pifs ) and Plan9.

Note that Perkeep provides a FUSE interface, i.e. you can use files.

Being slightly less facetious, it depends on the filesystem. Files can easily disappear if, say, a disk crashes or there's a network outage.

Those problems can be avoided if we make backups and distribute copies across several disks and machines, but that gives us a synchronisation problem:

- If something gets renamed during an outage, how do we know that it was a rename rather than a brand new file?

- If we find that two nodes have different content in files with the same name/path, which one is "correct"?

- If we don't have much local storage (say, a netbook or a 'phone or a raspberrypi), how can we take part in the storage?

- How can we cache things to avoid remotely accessing the same data over and over?

- How can we keep data self-contained, i.e. without needing external metadata/keys/parity info/etc.?

These are hard problems, and Perkeep is a very promising solution to some of them.

"You are in control of your Perkeep server(s), whether you run your own copy or use a hosted version."

Can the perkeep server be an SSH/SFTP login ? Or is there a server side component that would need to be running ?

I've thought in the past about the intersection between (camlistore) and rsync.net but it's not obvious what that looks like ...

I've been looking for a system that lets me track replication of online/offline data, as well as a search tool + format obsolescence report on files. I once started writing such a thing using Python + SQLite. It's kind of trickier than it seems.

This is in the same spirit as some OSS work I did a few years back, to enable similar scenarios


This looks like an article from 2013. "Last updated 2013-06-12" is in the footer

The date on that page is old, but the source code was last updated only a couple days ago, and the last release was in May of this year.



This comment breaks the HN guidelines, which ask: "Please don't post shallow dismissals, especially of other people's work."

If you'd review https://news.ycombinator.com/newsguidelines.html and follow the rules from now on, we'd appreciate it.

Please don't be unnecessarily rude. Someone is sharing a thing that they think is valuable and is decently fleshed out. If you want to be valuable, I'd love to know why you won't try the method/project/product. That feedback could help the developers.

1. Reason?

2. It's self-hosted. If you use it, "these guys" is you. Or, your leased s3 instance.

Hmmm how is this different from Box/Dropbox etc?

I don't think you can upload your own Dropbox server, or run dropbox locally.

Furthermore, dropbox uses folder structures, and can only sync folder-by-folder, and to have one folder synced requires EVERYTHING in that folder being synced.

There are many other differences that are listed on the article.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact