I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/
Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.
If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.
This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.
This is also the direction a lot of data engineering is going, where systems are going from ETL to ELT (that is loading data with minimal transformations as soon as possible and doing complicated transforms after that). This is e.g. why tools such as dbt are getting popular.
But I suspect OP's point is probably that you do at least impose a schema on import rather than letting it just be featureless JSON blobs or EAV tuples, which makes the job of understanding that data later so much easier.
Interesting. I did this for medical records, mid-aughts.
The phrase I used was "transform & translate at the last possible moment". Ideally in the reporting step (server side rendering the HTML).
Since all of our customers kept changing their minds, totally their prerogative, my strategy minimized cost of change and turnaround time. Prior products would have to "rerun the data" from scratch, for schema changes. We'd make the same query tweaks, but at the end instead of the beginning of the pipeline. (Hardest part was always nailing down the requirements, expectations. Then the code basically wrote itself.)
This strategy made people's heads explode. Old schoolers just couldn't abide by it. Despite our small team running circles around everyone else. Data must be normalized! All the same! We had to be cheating somehow. They did everything possible to destroy us, and eventually succeeded.
Have you heard of steampipe? It's a postgresql foreign data wrapper that exposes apis as virtual tables. It was recently discussed on an hn thread about pg FDW: https://news.ycombinator.com/item?id=28227905
I feel like this is a pretty good strategy, since you can either query from the api directly -- which seems to be surprisingly effective -- and drop down to materialized views if you need to cache the results for heavily repetitive queries.
To be honest it's mostly for fun, and to help me dogfood https://datasette.io/ and come up with new features for it.
But just a moment ago I was trying to remember the name of the Diataxis documentation framework - I was sure I'd either tweeted about it or blogged it, so I ran a personal search in Dogsheep Beta and turned up this tweet: https://twitter.com/simonw/status/1386370167395942401
Someone asked me the other day who they should follow on Twitter for Python news, so I searched all 40,000 tweets I have favourited for "Python", faceted by user and told them the top four users.
I consider schema-on-read to be the property of a system that retains the original response of a remote system call (mostly) unmodified, such that any derived values can be re-created to account for mistakes/omissions.
Doesn't necessarily mean that you have to project your schema dynamically every time from the original data, just that you have that option. If you're able to (mostly) re-create the original response and recover from any error, I think that's close enough to schema-on-read though the pedant could disagree (easy examples of fp precision come to mind).
I think of schema-on-write in the context of data from a remote system being copied to a local system such that mistakes/omissions require fetching data from the remote system to reconcile.
It sounds like you've also tackled dynamic schema migration from JSON graphs to Sqlite relations, minimizing relation count by keeping nested graph objects as their JSON string representation.
I think what I'm doing counts as schema-on-read then: the tables I am creating can be used to recreate the exact API response that I recorded, because there is a one-to-one mapping between the columns and the keys in that original JSON.
Your using the database definition of “schema” which is orthogonal to the definition in the article. Use the example of the browser’s bookmarks list. The browser is storing that with a DB schema. Yet an ever growing list of bookmarks has no “conceptual/knowledge base” schema
I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/
For my own personal Dogsheep (https://simonwillison.net/2020/Nov/14/personal-data-warehous...) that's 119 tables right now.
Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.
If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.
This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.
I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at https://github.com/simonw/datasette.io/blob/main/templates/d... - which powers the search interface at https://datasette.io/-/beta