Hacker News new | past | comments | ask | show | jobs | submit login
The sad state of personal data and infrastructure (beepb00p.xyz)
289 points by karlicoss on Dec 20, 2019 | hide | past | favorite | 92 comments

I've been working on a project along these lines recently. I've called it Dogsheep - the basic idea is to have scripts that export all manner of my personal data (from Google, Apple HealthKit, Twitter, LinkedIn etc) into SQLite database files, then use my Datasette web app to browse them and run interesting queries.

More about that here: https://dogsheep.github.io/

The tools I've built so far are under https://github.com/dogsheep

I think that Perkeep https://perkeep.org/ is also worth mentioning here. One of their latest tools is the one for exporting Google Photos: https://github.com/perkeep/gphotos-cdp

Whoa, this is super exciting to me. I've been looking for something that would get some of these types of data into a web-based interface that I could package for Sandstorm. This looks like it would potentially work for that fairly easily.

I went similar route after discovering Google Checkout refuses to return my YT comments. I wrote quick&dirty tempermonkey script logging submitted HTML forms (addEventListener "submit").

Did it work the way a scraper does? Curious to know how you did this.

I was only interested in stuff I posted on the internet for indexing/easier search, https://developer.mozilla.org/en-US/docs/Web/API/HTMLFormEle... and http://www.meekostuff.net/blog/Overriding-DOM-Methods/ covered form submissions, result goes to tempermonkey localstorage (GM_setValue).

Looks cool, I'll check it out! Might be good to join forces on scripts for data retrieval/parsing.

Hey, I'm on this journey too ! https://github.com/austil/datapuller

Nice. Any interesting use cases for the genome one?

There's a lot of things in this article that I agree with whole heartedly. A few weeks ago I posted this comment https://news.ycombinator.com/item?id=21650908 which shares some of the same frustrations.

Another thing that annoys me right now (and this is a problem of my own making...) is earlier this year I started taking notes with an app on my iPad called Notability. It works great with the logitech crayon stylus/pencil and is useful for jotting down notes when doing online courses etc.

Except I've shot myself in the foot a bit because those notes are now bound to the app. Yes there is the Notability app for OSX, and yes I should have anticipated this problem sooner, but that's beside the point, my notes are locked into the Notability ecosystem. They support this half assed solution to export them as RTF files or PDFs but you lose stuff like handwriting recognition.

One project on my TODO list is to see if I can reverse engineer the proprietary Notability file format, which includes the text recognition and all the things needed to render the lines that make up your notes. I know there have been attempts to do this e.g. https://jvns.ca/blog/2018/03/31/reverse-engineering-notabili... I just need to put the time aside to make it work

Something I would like is to be able to add simple metadata to photos on my phone (then somehow or other I can store them sensibly on dropbox/icloud) The basic idea is "receipt for travel" so I can file them easily

I know I can use concur and ten thousand different apps but FFS I dont need to.

On a slightly related note: I searched “paper” and “receipt” in the latest version of iOS photos and was impressed with the results. Combined with date range and location, it should be possible to find a specific receipt fairly quickly.

What about when Apple decides to change things up or get rid of that feature? In my opinion, you don't really fully own your data until you own the software you use to interact with it.

Oh I very much agree. A similar user interface running on self-hosted data would be ideal. I just suspect to get it used by everyone the experience would have to come close to what the Photos app delivers.

Yep. Ideally it feels that this kind of thing should be solved on file system level (e.g. https://en.wikipedia.org/wiki/Extended_file_attributes), but the current situations with the zoo of file systems on different devices/operating systems is a mess. Let alone the fact that it's not cloud sync friendly.

Evernote was good for this ― by recognising text in images and by adding text and tags to notes (IIRC they have a whole app for receipts and such things).

But Evernote gradually turned to trash by neglecting basic functionality like text editing and adding bugs with each release.

And one more thing was why doesn't my browser store my history - it just the URLs of where I have been but the html, the images, the text etc. Then let me search that.

Someone raised this on here a few weeks ago and it was a gobsmacking moment - my hand flew to my forehead and I realised yes - that would be so useful but the big tech firms find it more profitable to have that data on their hard drives not mine.

There is some effort into that: https://github.com/WorldBrain/Memex

You might find that (both the page and the project) useful too: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

Well, with ~7000 web pages saved in Evernote, I have the db of ~7000 mb, and the app that barely moves. So this might be one reason why.

If you are using Firefox, you can set the history storage period to centuries, search history in the URL bar (or History page) or when Firefox is closed query the places.sqlite database in your profile:

sqlite3 ~/.mozilla/firefox/profile/places.sqlite 'SELECT title, url FROM moz_places WHERE url LIKE "%mozilla.com"'

> Then let me search that.

Like Firefox's Awesome Bar?

The Awesome Bar doesn't search the content of pages you have visited.

I believe that's the exciting idea the GP is talking about.

I had the same problem with Samsung Notes on my Samsung phone.

It's a great app, on a great device.

But I don't use it for anything serious any more because I can't get the notes out into other tools.

Quite a shame really.

apple's own notes app similarly screwed me over. they apparently changed file formats with ios/ipados 13 so that the existing pencil drawings on my ipad pro were "rasterized" during the upgrade from ios 12 (they're no longer easily editable). they also removed features like zooming in/out for no discernable reason.

edit: also meant to say that exporting the drawings only gives you raster images.

This is why I think scraping rights is so valuable.

As the article says:

> Best case scenario is if the service is local-first in the first place. However, this may be a long way ahead and there are certain technical difficulties associated with such designs.

> I'm suggesting a data mirror app, that merely runs in background on client side and continuously/regularly sucks in and synchronizes backend data to the latest state.

Here are a few premises:

1. It's a fact that only a small portion of users care heavily about centralizing and truly owning their data

2. As such, it's reasonable for companies to not focus on exporting data. That's not how they get value out of their data

3. That being said, companies should at least not punish that small group of users for taking matters into their own hands

Scraping is our solution to this problem, and the least companies can do is allow well-behaved (rate-limited) scraping.

Okay, copy-pasting over my other top-level comment as I believe that solutions other to scraping exists: In particular, scraping means that the fundamental data source is the company. It shouldn't be like that, why should the company own the data I have produced?

> The Solid Project [0] (AFAIK which is led by Tim Berners-Lee) is made to tackle this exact problem.

> It is about defining a standard/protocol to store personal data in a 'pod' and give minimal/granual access of my pod to web apps.

> Apps are separated to pods, and it allows the decoupling of data & functionality.

> You can switch(upgrade) from text message to instant messengers without losing any chat history for example.

> It also has an advantage that it prevents lock-in, since one can move their data around trivially.

> Looks like the OP would greatly benefit from this.

> [0] https://solidproject.org

I have thought about these issues a lot; especially lately. With regards to being able to scrape your data or just get your data back from 3rd party's, I think that's a losing battle. You need to be in control of your data before it gets to them. Web sites and APIs are constantly changing and sometimes just disappear. This idea of polling for changes seems very brittle and would never be up to date.

What I picture is an a program that you use to store your own microblogs, blogs, contacts, comments, etc. and then you publish to whoever from that app via their API or crawling.

Imagine you just created a new microblog entry. You can now either post to your Twitter, Mastodon, etc. accounts with the click of a button. You would have to poll for replies though and it would be up to you to store them if you wished (you probably want to if you are storing your replies). As an added benefit you could see the replies in one place instead of bouncing between two sites.

The point is, when you create the data, it's yours first. Then if you want to, you can post it other places. Tools like this are abundant for businesses, but we don't seem to build tools for actual people anymore.

Sounds like you're inventing POSSE https://indieweb.org/POSSE

That's how I do it, see the original https://jeena.net/photos/524 the mastodon copy https://toot.jeena.net/@jeena/103214370709720207 and the Twitter copy https://twitter.com/jeena/status/1199954031134887936

Facebook on the other hand removed that API so I stopped crossposting there: https://github.com/snarfed/bridgy/issues/817

The problem I see with this as with most other projects linked in this thread is that they aren't actually "solutions". They're just ideas. You need programming and/or sysadmin experience to even begin using these approaches. What I'm thinking of is more of a solution that people can just start using without any friction.

> With regards to being able to scrape your data or just get your data back from 3rd party's, I think that's a losing battle.

Both Google and Facebook have tools that allow you to export your data in both (usable) JSON and HTML.

That's fine but how often do you export? How do you merge it with other exported data? Are you exporting your entire history every time? What about when Google/Facebook breaks their API or get rid of it?

That is why we are creating PrivicyPal -- to set a schedule to download and merge data from many sources and monitor over time. Will be adding beta testers soon.

See https://www.privicy.com/privicypal/about

Yep, exactly that!

As in, you can do GDPR export or Google Takeout now and then, but then you request it, enter your password/etc; in few days you will get a link to the archive. You can put a reminder to do it now and then, but then your export stales, it's just so frustrating. It's almost ok as a means of backup, but it's hard to use this data in a meaningful day.

The lack of an export feature is rather common among smaller services. What Google and Facebook are capable of doesn't scale down to other service providers.

As a counter-point (I'm playing devils advocate because I actually agree with the author in general): at one time I found an old laptop that I hadn't used in quite some time. Excitedly I booted it up and dug around the file system to find ... nothing of interest. I thought there were treasure troves of stuff there but actually there was very little of interest at all.

It made me think about those hoarder reality TV shows and how there is a digital equivalent of that. In some ways I'm glad all of my digital data isn't amassing and following me around. I'm not sure I really care about what I posted to MySpace pages back in 2003 or usenet in 1998.

This is really a matter of ambivalence to me. On the one hand I would love detailed data about my history, even down to the exact GPS coordinates of my location every moment of my life. On the other, I am not sure this data would truly improve my life in any way. It may be that access to such data could make my life worse.

I have a location trace over the last year and I am using it for time tracking (how long was I in the office). Also I have a heatmap of all the places I have been that is really nice to look at. Also it has helped me in cases where bike sharing companies said: where did you leave the bike on 27th of September? We can't find it. Or the doctor claiming I didn't appear on the previous consultation and I should pay the appointment out of pocket. One look at the timestamped map and I can recall the situation.

That is really cool. Can I ask where you are storing that data? Is it something you cooked up yourself or a service you are using?

One of my back-of-the-mind ideas is to combine GPS data like this with some kind of basic activity-type id. Like "working", "practicing guitar", "socializing with friends", etc. Then I could get reports on how I spend my time.

I use the owntracks app with a selfwritten http server that dumps it into a postgres. Then I use grafana and leaflet for visualization. Regarding your activity tracking: I already also track pulse and smart watch movement, but plan on adding more inputs like the travel mode detection on Android (walking, cycling, driving) That together with my calendar data should give sufficient data for time reports :)

It made me think about those hoarder reality TV shows and how there is a digital equivalent of that.

As a third generation hoarder this is how I tried to tame it - I started hoarding data instead of things.

After a few major losses(a social network local to my country shut down without providing any option to export the data, I accidentally started formatting my backup drive after I just formatted my main drive) I developed a healthier relationship with my data.

I'm still reluctant to delete things, but now I can at least bring myself to do it.

The Solid Project [0] (AFAIK which is led by Tim Berners-Lee) is made to tackle this exact problem.

It is about defining a standard/protocol to store personal data in a 'pod' and give minimal/granual access of my pod to web apps. Apps are separated to pods, and it allows the decoupling of data & functionality. You can switch(upgrade) from text message to instant messengers without losing any chat history for example.

It also has an advantage that it prevents lock-in, since one can move their data around trivially.

Looks like the OP would greatly benefit from this.

[0] https://solidproject.org

Has anyone built an actual, used-in-production app on top of Solid yet?

Here are some examples: https://solidproject.org/use-solid/apps

The problem though is the same with all self-hosting: maintaining a server. It looks like you're just on your own if you want to use a "solid" app.

Or how about just letting your data vanish, except for items you've made a specific effort to save (e.g. your own original writing or musical compositions)?

Technologists seem preoccupied with creating perfect recollection, seemingly without realizing that people who have this ability innately often find it burdensome.

Given my druthers, I wouldn't retrieve my data from a service, I'd purge it.

In either case, you control the destiny of your data, either forcing it to an archive or to /dev/null.

The problem is exactly the lack of such control. Frankly, if you let the data out, there's no guarantee a copy of them does not linger somewhere.

Having companies have an incentive to get rid of your data after some time, for profit reasons, might be helpful. For instance, work email is usually purged promptly after the mandatory retention period expires, so that it could not be used in litigation.

This got me thinking about personal data storage. I think step 1 in owning our data is having a place to store it. That storage should be abstracted from the actual provider(s) so we can migrate and/or replicate our data. It should also be available from multiple devices. A personal data warehouse like this should be easy to create, a la 'deploy to heroku'.

It's shocking to me how few people I've met take their personal data storage seriously. Most folks I know treat Dropbox/Drive like a landfill.

Agreed. I feel as much as we've seen NASs grow into devices that are vaguely approachable by enthusiasts, there is still a huge amount of ground to cover before they would be the sort of thing that one could easily deploy in their parents' house.

There has been piecemeal progress in swinging the pendulum back from cloud-everything to easier to use edge computing. The Helm email server is one example. The slightly more plug-and-play approach to modern NASs is another. And there are others. But you can tell that the vast allocation of R&D is not going here yet. I do think investors will eventually wake up and realize that user demand for data control means better edge devices and avoiding reliance on the centralized cloud.

What I have envisioned for PAO [1] is federated encrypted backup. I would like to see NASs allow me to basically allocate a percentage of my capacity to various peers to store encrypted-at-rest duplicates of their data. And vice-versa. Basically a federated mesh. No need for blockchain or other crypto-hype nonsense. Just straight authenticated and encrypted file storage.

My opinion is that cloud dominance really traces back to the advent of and self-reinforcing power of asynchonous Internet connectivity. When Internet connectivity was often synchronous (think the very early days of DSL), peer-to-peer networking remained very common. As the number of users using asynchronous connectivity increased, it became reinforced as more services centralized data and content. Peer-to-peer is effectively a relic of the past now. Only today have we started to see some resurgence of symmetric connectivity (e.g., 1Gbps symmetric fiber). I believe deployment of symmetric connectivity will be a decentralizing force as more people realize it's possible to just access your file system and data directly between devices rather than use an intermediary. And as vendors realize this is an opportunity space to offer interesting technology (e.g., the likes of Zerotier) to consumers.

[1] https://tiamat.tsotech.com/pao

> I do think investors will eventually wake up and realize that user demand for data control means better edge devices and avoiding reliance on the centralized cloud.

Hear hear! We need to not forget the core lesson of the internet, which is centralization is a weakness. I have not been happy with the increasing trend towards centralization in tech, and I agree that people taking more ownership is going to mean more edge devices.

The hardware is there, it's the software that needs to catch up.

I would like to see NASs allow me to basically allocate a percentage of my capacity to various peers to store encrypted-at-rest duplicates of their data

This is something I've been thinking about, too. We have so many Internet-connected devices with increasingly cheap storage-- Some universal protocol for distributing data across this network would be really cool. (I understand this could sound like blockchain. I have no horse in that race.)

I would worry about the liability implications of such an approach. If one of those peers is storing (say) child porn, and some of that porn ends up on my hardware via automatic duplication, does that implicate me legally in their crime? If it does, does it matter whether it's on my hardware with or without my knowledge? If the police identify my hardware as part of that peer's storage network, is that hardware at risk of being searched and/or seized? That sort of thing.

This isn't quite the end state you're mentioning, but I've enjoyed using cloudKit ((https://developer.apple.com/icloud/cloudkit/)) as a developer and user in this space.

There's no setup on the user's part. For the developer, it allows you to abstract the datastore and give ownership back to the user - the developer doesn't have access to the private databases of its users. It's possible to build completely client side apps with syncing between devices without ever being exposed to the user's content.

Looks like CloudkitJs also exists for the web. I'm not sure if it allows the user to export directly, but that would help guarantee users weren't trapped.

All that said, it's tied to the Apple ecosystem. An independent service with similar features and a large enough community would be interesting.

It is definitely concerning. 10 years ago I wasn't worried about putting my data in things like Dropbox because the general fear was "what if they go out of business?" to which I felt reasonably secure that I'd have time to get my data out of there before they shut down for good.

Now, it's much more likely that you'll get a "You have violated our Terms of Service and are banned from using <the service>", pointing you to a line that looks like: "The company may, at its sole discretion, decide what constitutes a violation of the Terms of Service, and terminate services as a result." It happens!

Why are you surprised? Most people treat their closet/basement/garage/attic/trashcan like a landfill.

Additionally, depending on any small set of cloud providers makes you vulnerable to the powers that be if they ever decide that your account will be shutdown without opportunity for appeal. How up a creek would the average user be if their dropbox, google drive or amazon storage disappeared without and opportunity to fetch it first?

We need something like a geographically distributed coalition of storage that allows one to provide storage space for others in exchange for storage of your own data remotely. Then data can be replicated into multiple locations and roughly be secured on a mutually assured destruction sense (If you take down replication of my data, you lose your replication on my site)

sort of been on my mind lately because like you say, should be easy ;) snapshot of current thinking:

-some open pit data mining/management protocol exists and scrapes data out of your own personal forward proxy / metal that lives on the edge of fat bandwidth, you can do whatever you want with it, autogenerate bookmarks and forum interaction tags if you want (hadn't thought of that,) .. including not store it, because the software that stores it is part of a personally owned open source platform that is also providing all the cloud services that you normally go to third parties to obtain


-the basics are baked in, its got your social media/self-promotional pages that are interoperable with others, an online store, search/index peers are essentially friends on social networks.. its gets foggy, how much granularity? what sort of resource commit to the forward cache? anonymization routines? regional compliance issues? capability to sell dataset(?) like, who would actually use it?

.. etc ..

-if something like this were actually to organize i think it would be best visualized as some sort of platform in support of some server-farm co-op. also i keep thinking of openstack being overkill, somehow, and am likely wrong.

-for a personal user on a watered down feature-set that isn't supporting a large organization and still elects to own their own bare metal, it would be like .. two netbooks in a post-office-box housing place that has fiber ..

if it's "your own personal forward proxy" then why would you want or enable the "capability to sell dataset"?

rather than restricting the software from commercialization by license it explicitly commercializes everything on behalf of the user ..

I'm partially being snarky. By offering opt-in backwards-compatibility 'ye olde establishment. Also intending to offer a path forward for businesses already relying on the business model, for users expecting products relying on the business model. I know I'm playing with a pipe-dream-sci-fi-quantum leap in the relationship between the end-user and the internet ..

... so I'm trying to be snarky and also fair and thus hopefully incentivize existing entities to implement the protocol and use it in order to take bites out of big data in manageable bits without setting everything on immediate fire. The folks who write apps with a bent on data-mining may be open to something more provider independent in order to draw in users.

also .. half-baked early adopter: .. you are a streaming content author or have an online shop and want your content to be redistributed, and are willing to make some metadata deals in order to do so. you are peered with dozens of indexes and some of them require different participation levels, maybe you have shipping partnerships, you work with some online labels or other profit-sharing outlets and this useful metadata associated with traffic that content has generated in your PO box is requested by these partners. So, the parts of your "forward-proxy-cache" that were relevant in these transactions would want appropriate taxonomy in order to facilitate ongoing partnership. I see users on the internet who like targeted adds, I know people in reality who like shopping.. I dream of a world where they all get better hobbies but I'm not trying to judge. ;)

personal forward-proxy.. a reckless way of putting it, also s'/sold/shared/' where users are hopefully suddenly tuned into the reality that once something is copied out into the public domain..

Maybe you prefer money over privacy. Not my choice but I think a lot of people would choose to sell some data for a monthly check.

Even as someone who has several terabytes sitting next to me in a homelab, I still leverage dropbox. You need something with strong integrations into other services to treat like a kind of internet RAM for short term storage.

My current approach is to have a nightly job which pulls my dropbox and other cloud storage into my local storage, but I'm planning to look into an S3 compatible service to see if that integrates well enough for my needs.

The answer to why the author of this article can’t do any of the things they list is simple. It is hard to write the software. The data is there but some program has to use it.

The author asks why he needs some start up to take his data and then trust them to do things. It’s because that start up spent a lot of money to build the software to run on their servers and now they don’t want to give it away for free.

In my opinion, this is the root cause of most of the issues with our data and control on the web in general. There was a time when the web itself disrupted the centralized networks of the day like America online, MSN, and CompuServe. The centralized services we use today such as Facebook, Google, Amazon, Twitter and others could have never been built unless a permission less platform like the web existed and disrupted centralized platforms. The centralized services we use today such as Facebook, Google, Amazon, Twitter and others could have never been billed unless a permission list platform like the web existed disrupted America online and the others. They had invested a lot in the infrastructure that the web leader replaced. They were extracting rents and controlling the platform.

Today we need something like that but the infrastructure we need to disrupt was built by Facebook, Google, etc. Wordpress and OpenStreetMaps is just one example.

That’s what I have been putting my money into for 8 years and open sourcing:



Take it, use it, it’s free. Build on it Whatever solutions you want to manipulate whatever data you want. And perhaps more importantly, host software for entire communities of people who want to collaborate and connect with each other. Not just manage their personal data.

Feedback welcome on how to make it better. We are planning to officially launch later next year to be like a Wordpress of 2020.

Hey, I just want to say your post generally resonated with me and definitely made me want to check out qbix. I've also spent years on developing software that I'm giving away for free - most of it anyway, but not the parts that would enable others to offer commercial hosting at scale. My software is "just" implementing standards (ISO and OASIS) though, so users don't have to spend their time on proprietary languages, APIs, markup, protocols, etc. With your initial statement on non-free software by start ups and your conclusion to give it all away, I thought you might be interested in this standard-oriented approach as kind of a middle ground between proprietary vs free/open SW.

Interesting to compare to article that came through yesterday about giving up on semantic web, where the whole movement got a sound drubbing in the HN comments.

These are infrastructure problems, they should be treated as such, i.e., maintained by tax dollars.

The failure of the semantic web and the sad state of personal data are primarily failures of the free market to solve these problems, imho.


This doesn't seem like a failing of the free market. The free market has already created products that solved this problem, and they are some of the most popular products in history, but people are moving off of them, e.g.:

- People are moving from Microsoft Office to Google Docs

- Designers are moving from Sketch/Photoshop/Illustrator to Figma

- People are moving from Evernote/Text Files/Whatever to Notion

- People are moving from HTML files to Webflow

- People have already moved from native mail clients to Gmail

There's a problem here, but it isn't that the free market hasn't solved this problem, it's that people are choosing other features (mainly collaboration) as more important than data ownership.

Note that this is a transition mainly driven by tech people, none of these products have gone mainstream yet (excluding Gmail of course), and the products that offer data ownership are still far more popular overall. But if the mainstream follows tech peoples lead, that won't be for long.

> people are choosing other features (mainly collaboration) as more important than data ownership

If it were treated like infrastructure, we could have both.

Pointing to a product that got replaced by another product doesn't inherently prove anything.

And seriously, who uses Gmail voluntarily? The complete non-existence of a suitable SMTP implementation is a pretty good example of a _clear_ failure of the market to properly allocated resources.

edit - thanks for pointing out Figma though, might have to check that out

It's almost as if the free market has an incentive mechanism to innovate better products

Pretend we didn't have computers and people wanted to store all the facts in their lives - how many people could manage a library / filing system rich enough to catalog the level of information we're expecting to keep here?

> Pretend we didn't have computers and people wanted to store all the facts in their lives

This is a pretty moot point since if we didn't have computers we wouldn't have things recording a lot of the data in that article to begin with. For example, with location, you could write down where you are every minute of the day, but that isn't very practical. Luckily, we have computers to automate that. Does that mean you should have to give that data away to a third party?

> how many people could manage a library / filing system rich enough to catalog the level of information we're expecting to keep here

Nobody could do that manually. That's the job of computers. I suppose you could keep a journal and store boxes of pictures and a lot of paper. It would be a pain, take up a lot of space, and take forever to search through. Luckily we do have computers and they happen to be really good at searching. So you should be able to just store this stuff on your computer and have it assist you with owning that data.

Instead what we have is a world where you upload everything to the cloud so someone else owns the data and you have no idea what's happening with it. They also get to choose how the data is presented to you. Since your data is spread among so many companies it hard to get the aggregations mentioned in the article. Usually the only way that happens is companies agreeing to share your data with each other. Most people are okay with this since it's "free".

My point is, people's minds aren't caught up to the scale of data, and I don't think there is a technology solution for it. We now have more data and more technology but it's not solving the problem.

Hell I don't always know where I'm going to put all the groceries I take home.

EDIT - to add - too much data isn't really useful. When we are talking personal data collection it's basically a librarian's job, which is non-trivial

> My point is, people's minds aren't caught up to the scale of data, and I don't think there is a technology solution for it.

I totally agree with the general thrust of this argument. I'd like to hear more about use cases for this kind of personally-owned, aggregated data store. Once this article started talking about searching over, say, notes and highlights from articles and blog posts, I started to see specific use cases that seem totally compelling. However, it's not clear to me how this part of the data ownership conversation matches up with the seemingly more principles-driven data ownership conversation.

Theoretically there's nothing stopping you from building some of these more specific implementations (which the author has done--btw those projects looks really cool).

> Hell I don't always know where I'm going to put all the groceries I take home

Just don't buy so many groceries. Jk. Actually, I have started working on a project that would help solve this problem (in combination with solving others). It's just nowhere near ready and won't be for a while.

> too much data isn't really useful

That's not what all the companies building huge data centers are saying.

> When we are talking personal data collection it's basically a librarian's job

I'm not really sure I understand this point. What do librarian's have to do with this?

Not made for storing 'all the facts [of our] lives' - but an interesting example of a robust physical system for knowledge organization is Niklas Luhman's Zettelkasten Method[1].

It translates well to a digital medium. The general idea is a collection of granular information (notes) interconnected in a non-hierarchical way using tags.

[1] https://www.lesswrong.com/posts/NfdHG6oHBJ8Qxc26s/the-zettel...

TiddlyWiki is the closest digital version of that philosophy.

For those who could, journaling (or even paying someone to chronicle you) was a regular activity that consumed significant time. Especially in a time when sending mail was involved so you wanted to send data dense mail less often to your friend circle.

I believe the data quality of such a library could be improved incrementally through enrichment, labeling, ETL, etc. Treat it like a garden.

What does ETL look like without computers?

Are there any other interesting articles on good use cases for this kind of centralized data ownership. The syndication concept and the discussion about data rights aside, why else might I want to have all the data I ever create or interact with in aggregate like this? The author has referenced a few really interesting projects they worked on, curious if there are more good ones.

I'd be interested to know too. One reason I started writing this all up is to harvest more of other people's setups, tools and workarounds -- I'll be sure to link them later.

Sounds good, thanks. Cool blog btw, I poked around for a while. Nice work!

Because you own an AI with a strongly developed logic module.

> (typically) the more something machine friendly the less it's human friendly.

I believe I have solved some of that problem. Hode (the Higher Order Data Editor), a kind of generalization of graph databases, lets a user enter data in a manner which I believe is as similar as possible to natural language. To encode the fact "cats kill birds" as a "kill" relationship between "cats" and "birds", you just write "cats #kill birds". Relationships involving other relationships can be represented with similar ease, and relationships involving any number of elements. The query language is not much more complicated -- for instance, everything that kills birds would be "/e /it #kills birds".


This worries me too - a lot. So much so that I'm working on a video series documenting the creation of a personal photo & video storage system from scratch. My belief is that one way to extricate our data from big tech's catacombs is to spread more widely the skills needed to build out personal data stores.

I'm starting with photos & videos because my own collection is trapped in Flickr, and it may not be long for this world. I intend to build out such a system and publish instructional videos on how to do it as I go along.

Imagine a world in which more or less everyone can do this. It's not such a long shot, don't you think?

I can’t help but think that this glut of resources we are burning on deep library/platform stacks and spying on users might be better spent on attaching more metadata to all information flowing through our systems. Some provenance might be good.

I’ve seen a bit of how Boeing tracks parts and doing something simpler but similar for data might be tractable now, except that i think it would take APIs that worked substantially different than conventional code. Except perhaps in Ruby and Node (thinking in particular about htmlsafe tagging in Rails)

I actually do share many of the author's thoughts on how maddening it can be to collect specific pieces of data that seemingly should be collectable but aren't because of some stupid reasons xyz. To play devil's advocate, though - Is it really malice, or just that people rarely think about these particular use cases in the first place?

It seems like a lot of the things and products mentioned here, if released by an independent dev or small team, could similarly be overlooked. I can't imagine most of the engineers I know (and I suppose especially not rich megacorps) to really ever consider the .01% of people (the kind of demographic you'd find on HN, I guess) saving ALL browsing history across browsers according to some universal standard or LinkedIn statistics or YouTube text history.

OT-ish: I can see how this would be relevant for most people living well enough, say, like middle-upper class America, who can afford these technologies and to care about the multitude of examples presented, but is there a conversation about how much data we should or need to be collecting (to say nothing of handing off to 3rd parties) at all in the first place?

I've always felt that relying and interacting with less technology (or at least making efforts within reason to) was better for my own quality of life (less tracking, less worrying about posting regrettable stuff, sticking to basic principles like "move more, eat less" instead of obsessively counting stuff on my old MyFitnessPal and Fitbit) - surely I'm not alone?

Great to hear we resonate!

This is a valid point, I kind of admit these are sort of first-world problems. But my main motivation for raising this issues in the first place is to learn better, process information more effeciently, have better memory, and this is something I wish to use to work, learn and reason about things that really matter, like climate change or solving poverty, etc.

Regarding tracking less -- I guess people are different, I do know people who are happy to just stick to 'move more, eat less'. For me personally such maintenance is boring, and looking at stuff like workout/sleep data etc really motivates me to learn more about it and keep going. I really hate going for another run but at least I'll have a datapoint after it!

I feel like the actually stressful bit is having to think about tracking. If it was done automatically and you didn't have to think about it, then why not? You'd always find people who are obsessed about doing (or not doing) things even without counting I guess.

Isn't, overwhelmingly, the problem the economic model all software is made under? Any attempt to come up with a solution to this that ignores this fact will never treat the root course of the issue.

Maybe! But I'm certainly not feeling in capacity to challenge the current economic model :)

One reason this 'sad state' exists is because content providers (for example websites, apps, etc) need to generate revenue and the most prevalent method is advertising.

Micropayments would go a long ways to shifting providers away from advertising and towards pay-per-use. Many users would not object to paying a fraction of cent for reading an article - especially if it would enable the provider to remove the tracking and invasion of privacy all in order to make a buck.

The reason I started / made cloudcmd [0] back around 2009 is that I saw this coming. My goal was to build a decentralized storage system with search capabilities and smart enough that you could rebalance storage based on cost and convenience.

Is now the time to restart this?

[0] https://github.com/briangu/cloudcmd

I've been using Tiller for a few months:


Totally happy to pay for someone to maintain connections between bank APIs and a Google Spreadshert. Curious how long they'll last.

I agree it's sad, but there's an incentive to keep your data captive. This problem isn't really a technical issue.

Wonderful post!

> Why can't I search across watched youtube videos even though most of them have subtitles hence allow for full text search?

This blows my mind every time I'm on youtube... so much potential, and yet.

> Often, a friend recommends you a book so you want it to add to your reading list.

Yep. For a while I was collecting them in a spreadsheet. After about two years I've realized it's actually a lot more about the context of why/where/when I added a book rather than it simply existing in a long list. Even though I had "Source" and "Date Added" as columns, I (still) have no way of grouping them by topic cross-referenced with my notes.

Also, the conversation in which I received the recommendation likely has valuable context I haven't included, and good luck deep-linking to a message. (Telegram handles this OK, but Gmail? Or god forbid iMessage).

> Why can't I see what was my heart rate (i.e. excitement) and speed side by side with the video I recorded on GoPro while skiing?

Another angle I've considered: the past four (text/email/GPS) interactions with (some person) has resulted in higher stress levels ... this is an insight I typically extract from writing about my day. Would be interesting to have it suggested to me. Yes, lots of privacy implications here.

> It's just a matter of regularly fetching new stories/comments by a person and showing new items, right?

Not sure if you've seen fraidycat [0] and the discussion [1]. Basically, a fetch-and-consume model for blogs, Twitter, etc with frequency/priority levels.

> Why am I forced to manually copy transactions from different banking apps into a spreadsheet?

Plaid [2] looks promising, but I haven't built anything noteworthy with it yet.

> Why can't I easily share my web or book highlights with a friend?

This is ridiculously hard. I think my favorite solution to date is to copy-paste the whole article into a Google Drive doc and annotate it. Not a good solution, I know.

> I wonder what computing pioneers like Douglas Engelbart (e.g. see Augmenting Human Intellect) or Alan Kay thought/think about it and if they'd share my disappointment.

I imagine they would be/are very upset.

0 - https://github.com/kickscondor/fraidycat/issues

1 - https://news.ycombinator.com/item?id=21802952

2 - https://plaid.com/

Just a quick grammatical note:

> Monzo API only allows to fetch all of your transactions within 5 minutes of authentication.

The referenced link states:

> After a user has authenticated, your client can fetch all of their transactions, and after 5 minutes, it can only sync the last 90 days of transactions. If you need the user’s entire transaction history, you should consider fetching and storing it right after authentication.

So I would think the sentence would be:

"Monzo API only allows to fetch the last 90 days of your transactions after 5 minutes of authentication."

...which actually seems worse.

I don't understand. The two sentences both seem to match the referenced link -- the first says that it is only possible to fetch all transactions in the first five minutes, while the second says that after five minutes, you can only fetch the last 90 days of transactions (so, can't fetch all). Are you saying it is worse that you can still fetch transactions after five minutes?

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact