If you’re interested in this ‘Personal Data Warehouse’ concept, you might also be interested in @karlicoss’s articles[0][1][2] about his infrastructure on saving personal data.
There are various technical differences, for example @simonw preferring lightweight SQLite databases compared to @karlicoss preferring dumping the data and parsing-when-needed[3], but the purposes are similar enough that I think they should be mentioned.
There were also some very constructive HN discussion[4] on these articles, where @simonw have also introduced Dogsheep before :-)
I'm building something like that, although my goal is to create a certain UI, not to warehouse my own data. I want the usual photos timeline, but extended to include other artefacts of my daily life. The goal is to roll back to a point in time and find my photos, conversations, transactions, actions and location history.
Working on my own files is easy: pull files incrementally to a central place with rsync, then process the changed files. Use their checksum to create previews without duplicates.
Working on my own websites is also easy. I just need to add an RSS feed.
Unfortunately, fetching data from other sources is much harder. It made me realise how much of my data is held hostage. Most of it can be retrieved manually, but not with a script that runs regularly. Instant messaging and location history are two big examples.
Repeatability is another problem. For example, Reddit only lets you access your last 1000 comments.
And at last, you must deal with updates. If you go back and revise a comment or a post, the data should be updated.
For IM specifically, one thing you could do (if the IM platforms you want are supported) do is to set up a Synapse homeserver with bridges and then you have everything (encrypted or unencrypted depending on config) in a SQL database. May not be worth the overhead depending on your hassle-tolerance if you're not interested in using matrix.org otherwise (I recommend it though).
At least a lot of the annoying groundwork is already done. If you do hit some snags, bug reports and PRs are generally very much appreciated in the bridges projects.
Have you tried Tailscale? It uses WireGuard to setup a secure mesh network between your devices that punches through NAT, and it is incredibly easy to setup.
That's really cool. I'm also in this space and have been thinking about this recently, with no answers.
Tailscale seems like a great option but i'm mildly concerned about relying on a company for this. Plus the solo plan with a family option sounds a little.. meh.
Definitely an interesting take on this problem, appreciated :)
My idea is to use webrtc very cleverly. You could make money as a default handshake server. It's an interesting model for disruption because you don't have to worry about legal exposure to the rapidly eroding safe harbor landscape.
You could always host the files elsewhere. I don't do that because this is first and foremost a file backup tool (rsync-based). Otherwise I'd put it on a cheap DigitalOcean VPN.
It's great that you're building an interface for that, please keep us updated!
I think it's great if we solve complementary problems (i.e. I've been heavily on the 'liberate and access data' so far), and plug into each other's solutions.
It's at github.com/nicbou/backups. It's already live on my home server, but not really built to be distributed. I build some software like meals: to be enjoyed by myself and a few guests.
Thanks! Are there screenshots or something like that?
"not really built to be distributed" -- completely understandable... sharing has been much harder than I imagined! For myself I'd probably be better off with some huge monorepo.
Late to the thread, but there's an open source app called fluxtream (no S) that was designed as a personal data logger and aggregator. It might be useful for inspiration. I haven't gotten around to trying it, and don't know how well maintained it is now.
> Most of it can be retrieved manually, but not with a script that runs regularly
this may be down to ability of the script or its author. most things retrievable manually are retrievable by scripts, bots, scrapers, etc.
the bigger data captors also provide APIs which can help to a degree, and where limits are imposed there are usually workarounds
Sure, but it's hard to build something reliable and long-lasting that relies on scraping websites that actively try to prevent scrapers. It's not that I can't build it, but rather that I won't.
Thanks for sharing, me and @simonw seem to be bumping into each other regularly :)
I've been meaning to give Datasette a try and plug it into my system!
Even though I prefer to rely on code as the main interface, in most cases I already have sqlite for free, because it's used as a cache [0]. If a function is marked with @cachew decorator, its results would be cached on the disk, invalidated on arguments changes, etc.
Only had to adjust the query in the last step to conform to the naming (i.e. `select _cachew_union_repr_Photo_geo_lat as latitude, _cachew_union_repr_Photo_geo_lon as longitude`)
The other day, I was considering a mix of these - ddl/dml as a serialization/synchronization format, which gets processed locally to form a sqlite db. Then use rad tools, to craft custom UI for your needs.
This is a video and annotated summary of a talk I gave for the GitHub OCTO Speaker Series yesterday, describing how I've been building my own personal data warehouse on top of my Datasette and Dogsheep open source projects.
I feel like every nerd eventually tries to build something like this. My main contribution here is the observation that importing data from personal sources into SQLite massively reduces the amount of work needed to run further analysis, especially with a web interface that lets you run and bookmark queries.
SQLite can even join data together from multiple different database files, though the Datasette interface doesn't support cross database joins just yet.
Yes, you can load data from different sources into the same database - each of the Dogsheep tools lets you pick the database file that you are importing data into.
Hey thanks! I'm interested in building this over the next year or so, and I was thinking of using SQLite for exactly these reasons, thanks for confirming my suspicions
I'd like to encourage you to consider doing something for people too. Even if it's in the 'cloud' but maybe where something like telegram/signal has to 'pair' each device with your account and ultimately, the user has the keys.
I don't have the time to watch your presentation but I skimmed your slides. Wolfram is legit. I didn't know that about him and I plan on reading that article about his productivity.
Also, I didn't realize that others were actually interested in this. I've been trying to do this type of stuff for....about 4 years now. I'm not super tech savvy and I also keep getting busier. I really want this for myself. I have a ton of webpages that I want to put into a system (dated on when I enter it, tags, etc). I currently salvaged my entire email system, even after a lot of headache (years....worth). I have Todoist for tasks. I have Pocket for bookmarks. I want a place for all my notes, tidbits, all my digital content to be put into one place.
Even recently, I was thinking, I could get an ANN/ML that would see what I'm 'saving' and doing and it could suggest higher value ideas/content for me to consider. (Like a Youtube recommendation engine but...instead of trying to get you to waste your time on something, instead the ANN is focused on trying to give you a ROI for what you've coming across; becoming a better father, as ME, as an individual..., or taking my first step to learning something that I didn't realize that I wanted to know [accounting is one strange topic that has arise in my life], etc, etc). I would like to eventually get a team and hire a few personal assistants but...ultimately my game plan is what I wrote above. A decent algorithm that has a solid feedback loop based on my experience through life, to help me become more productive on what my goals and thought processes are.
I say this all to you, hoping that maybe you or someone in HN could build this. I'd probably be willing to go up to $1k a year or something. Until I generate a lot more money for myself, through my own brute force, I won't have the cash flow to get someone to help me build this myself. I wouldn't want to sell it. I'd want it for myself because there is a 'good enough' threshold that I think would be incredibly valuable to me. (Sure, you could keep iterating...but I suspect that after the threshold is reached, it would be a certain limit of diminishing returns until some major algo breakthroughs happen)
P.S. I do understand than you want to target businesses, etc. That's where the money is, sure. But...I wish people could finally get a solid product for once. What's that email service that reduces everyone's clutter that people are paying significantly for? Yeah. Be that, but for the personal data warehouse. I'd pay the premium.
Dogsheep Beta is a pun on Wolfram Alpha [1] (dog/sheep/beta are humble alternatives to wolf/ram/alpha). Both are queryable knowledge bases.
Dogsheep Beta leverages SQLite FTS (Full Text Search) adding a search interface to Datasette’s SQL, accessed via a Python powered WebUI. All self-hosted.
The “personal” part leverages Datasette plug-ins for services that export user data (e.g. Twitter, Hacker News, GitHub, Pocket, Evernote, and Apple HealthKit).
More a poor-man’s ElasticSearch for SQLite/Datasette than a Data Warehouse, in my opinion. Simon Wilson is on a roll delivering simple yet powerful tools.
Love these topics. I've been obsessed with this subject over the last ~4 years since i heard about Camlistore (Perkeep).
Oddly, in the time span my love for content addressable stores has only grown - but my desire for "retaining" any online data has shrunk to near nothing. With the exception of Spotify, i avoid not only giving data but leaving a footprint on the cloud. I know it's happening, but i rotate accounts frequently, delete old posts, etc - and try to scrub with limited effort.
This hasn't affected my motivation for a data warehouse; but it has limited my desire to spend any effort on keeping things in sync. Which is good for me, because watching the effort Camlistore went through to constantly scrape and maintain syncing behavior from Facebook, Twitter, etc.. i'm glad to avoid it.
These days i mostly focus on writing a Camlistore with some of my own preferred interfaces. Notably "normal SQL database". I didn't enjoy Camlistore's app interface, i just want the feeling of using SQL. But i also love content addressable stores so that's where i sit.
My efforts currently work towards merging concepts between Camlistore and Noms DB, with a SQL and Git like interfaces. Large problem for side projects, but plenty to keep me interested.
I believe all data warehouses are limited by the quality of their data model. Most start with good relational intentions over a small domain, but eventually get bogged down arguing how semantic angels might dance on ontological pins. The parts that work become ossified and impossible to change. The system starts to fragment into multiple federated datastores or unstructured file dumps (“big data!”) where you have to build your own integration every time you want to use the data. Someone comes along and proposes a unifying model (“everything is an event!”) and rebuilds the whole thing but with an extra layer of complexity. Someone suggests buying an industry data model instead - surely the data experts will have solved all these problems for us? A skunkworks project spins up and starts implementing the bought model with good relational intentions over a small domain...
I don’t think personal data warehouses are immune to any of these forces.
My implementation deliberately avoids any hint of a unified data model: each source of data (Twitter, Swarm, GitHub etc) gets a database schema that matches as closely as possible to the original JSON. Then you figure out and bookmark basic SQL queries for each one.
If their data model changes, stuff will break and you will need to fix it. I'm fine with that.
>I believe all data warehouses are limited by the quality of their data model.
This is common and most organizations overcome this problem with a higher effort downstream by Analysts & BI. I personally prefer 3NF tables, event streams or Kimball, in that order, as solutions to create a better DW. The reality is ugly behavior in data modeling can be overcome by Analysts & BI tools. And that seems to be the norm these days...use a DW to structure data to make data accessible, then let Analysts figure out the metrics. With the recent emphasis on tools like Looker or Tableau, this pattern seems to be working.
This was a super information-rich video and I’ll definitely have to check out the Datasette library. But when I think of a “personal data warehouse”, I see a lot more utility in centralizing financial data (cash, assets, debts, receipts), health records, government documents, and other more... non-trivial information. The examples of “plot your dogs weight over time by filtering tweets” and “rank pictures you’ve taken of Pelicans by ML-graded aesthetic quality” are delightful but absurdly frivolous (and I honestly can’t tell whether they were meant as serious examples or not..).
What I want this to be is a kind of mint.com, but with the ability to ingest from a wider swath of sources, and without having to grant a third party app access to my bank account. I’m sure many companies have tried to make all-encompassing life-tracker tools that I’m not aware of, but I’m equally certain that they have serious security and privacy problems. And it’s entirely possible that this is all just wishful thinking... that centralizing sensitive data in this way will always create a single point of failure that is inherently risky.
The examples I use are meant to be frivolous and amusing, but more importantly they're examples that it's safe to share in a demo. I've done some work on importing financial details into SQLite and Datasette (so far just using CSV exports from online banking) but they're not something I'm comfortable sharing in a video!
I should probably add a note about those to my presentations - explicitly call out that you can absolutely import financially sensitive data into a personal data warehouse like this, but it won't be something you'll want to show other people.
First of all, I appreciate the response! The examples were good - they demonstrated the flexibility and depth of the tool. I suppose I’m just lamenting the fact that I’d like to use this tool to create a data warehouse for sensitive financial info, but I don’t trust myself to ensure I wouldn’t be creating a massive risk of identity theft, given my level of security expertise.. again, loved the video.
Having a copy of your data "reclaims" your data only in terms of access but it does not reclaim the control over it.
Once you upload data into many of those services, in most cases you them give a permanent license to use it in whatever way they want.
Then, if you have your own website, nothing prevents Clearview AI [or some equivalent company] from crawling your website, index their photo into their facial recognition db. I don't think those companies care at all about robots.txt.
That's absolutely true. The GDPR gives you a right to request deletion, but it's not a great defense against companies like Clearview harvesting your data for other purposes. Hopefully the legal framework will continue to get stronger around that.
In the meantime, we can still have a LOT of fun by pulling our data back into systems that let us run our own queries.
My data warehouse has always been a computer under my physical control with my data. TrueNAS 12 is awesome and hardware is a steal (ebay r720xd 128G with a bunch of disks)
You mentioned wanting to do it on mobile, I wonder if you could plug it into a little Beeware [1] app.
[1] Beeware - https://beeware.org/ - Write your apps in Python and release them on iOS, Android, Windows, MacOS, Linux, Web, and tvOS using rich, native user interfaces.
> But the really fun part is that it turns out any time you track an outdoor workout on your Apple Watch it records your exact location every few seconds
I'm a big fan of the idea, especially in the light of Google Photos announcement. I'd ideally want my phone to back up my pictures, location, health data, documents, and other data to something self-hosted, with at least decent mining/visualization/search built in, without ever sharing it with any third party.
I have a home nextcloud server for this (easy to run with docker!).
- The nextcloud phone clients upload our photos/videos automatically.
- My homedir videos, photos, and documents directories are all symlinked from my nextcloud directory so they sync automatically.
- My journal/notes are stored in Markdown, inside my documents directory. Synced per the above.
- nextcloud includes a calendar server, to which I sync my google and outlook calendars.
- I have face recognition run on my photos by a nextcloud plugin.
- nextcloud also includes online collaborative docs, and other capavilities that I don't use yet. Probably there are some for Geo data.
The whole thing is searchable through the web, desktop, or phone clients. I can pick a day and call up all the photos, notes, calendar appointments, and files for it. Or search by person, or search names and contents of files... It's also got a WebDAV interface and a standardized API, share links with optional passwords, and it's multi-user friendly.
Oh, and of course it's the cheapest way to get terabytes of storage.
I'm a little wary of OwnCloud, and generally of any software written in PHP. PHP invites problems. That said, I find this page reassuring: https://owncloud.com/security/. It does seem like they took security pretty seriously a few years ago, and only a few issues have been discovered since then.
There are various technical differences, for example @simonw preferring lightweight SQLite databases compared to @karlicoss preferring dumping the data and parsing-when-needed[3], but the purposes are similar enough that I think they should be mentioned.
There were also some very constructive HN discussion[4] on these articles, where @simonw have also introduced Dogsheep before :-)
[0]: Building data liberation infrastructure — https://beepb00p.xyz/exports.html
[1]: Human Programming Interface — https://beepb00p.xyz/hpi.html
[2]: The sad state of personal data and infrastructure — https://beepb00p.xyz/sad-infra.html
[3]: Against unnecessary databases — https://beepb00p.xyz/unnecessary-db.html
[4]: https://news.ycombinator.com/item?id=21844105