The other day I spent a couple of hours trying to extract the data for my runs from Nike Run Club's servers. They have shut down their public API but thankfully some GitHub repos presented working solutions.
As I browsed through ten years of my run data, I had the idea of creating a software that let's you extract all (or most) of your online data and then search it using natural language queries. It is intended to help you piece together the puzzle of your digital footprint/history.
For example, by integrating with cloud storage (Dropbox/Drive), browser history (Firefox/Chrome), fitness apps (NRC/Strava), blogging/note-taking software (Obsidian/Bear), bookmark services (Pocket/Omnivore) and entertainment apps (YouTube/Spotify), you should be able to get answers to questions like –
* "What was my average running pace in July 2019? Show me all of my trail runs from that time."
* "What all recipes did I save in 2020?"
* "When did I start writing about TypeScript? Which tech blogs was I reading at that time?"
* "What videos was I watching between 2015-2017?"
* "Show me the artists I discovered on Spotify during the pandemic."
My intention is to build a local-first and privacy-first solution with a simple SQLite database. The user should be able to do whatever they want with the extracted data. A savvy user might build their own GUI, while a not as savvy one might just like to archive their data in a personal storage server.
Does this sound like a good idea? Is it something that you would want to try? Do similar solutions already exist?
P.S. When I told my brother about it he called it "the god app" which I thought was pretty funny and accurate.
Do I think I'll see it in my lifetime? Not a shot. The odds of someone gaining access to all the data in all the services I use/used have to be close to nil. Companies aren't going to give up their user data, that's their bread and butter. Each service will need to have custom extractors written, and likely rewritten in a never ending game of cat and mouse. That's even if you don't get sued for accessing their systems to extract their data they have on your user.
Then the storage required if you are able to get all the data. I requested my data from Apple a few years ago. It was something like 10GB of information. I assume shopping, social media, fitness, vehicle, mapping, etc. services have similar amounts of data on me. I wouldn't be surprised if the average digital identity has 1TB+ of data associated with it. Then, you have to normalize all the data. Each service is going to have the data in its own format with their own nuances that'll be a huge pain to get to a singular searchable format.