Hacker News new | past | comments | ask | show | jobs | submit login

I've come around to almost the opposite approach.

I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - https://dogsheep.github.io/

For my own personal Dogsheep (https://simonwillison.net/2020/Nov/14/personal-data-warehous...) that's 119 tables right now.

Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.

If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.

This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.

I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at https://github.com/simonw/datasette.io/blob/main/templates/d... - which powers the search interface at https://datasette.io/-/beta




This is also the direction a lot of data engineering is going, where systems are going from ETL to ELT (that is loading data with minimal transformations as soon as possible and doing complicated transforms after that). This is e.g. why tools such as dbt are getting popular.

But I suspect OP's point is probably that you do at least impose a schema on import rather than letting it just be featureless JSON blobs or EAV tuples, which makes the job of understanding that data later so much easier.


+1 to both. "Store things in a schema that reflects the origin" is schema-on-write, just a very particular kind of it.


> systems are going from ETL to ELT

Interesting. I did this for medical records, mid-aughts.

The phrase I used was "transform & translate at the last possible moment". Ideally in the reporting step (server side rendering the HTML).

Since all of our customers kept changing their minds, totally their prerogative, my strategy minimized cost of change and turnaround time. Prior products would have to "rerun the data" from scratch, for schema changes. We'd make the same query tweaks, but at the end instead of the beginning of the pipeline. (Hardest part was always nailing down the requirements, expectations. Then the code basically wrote itself.)

This strategy made people's heads explode. Old schoolers just couldn't abide by it. Despite our small team running circles around everyone else. Data must be normalized! All the same! We had to be cheating somehow. They did everything possible to destroy us, and eventually succeeded.


I’ve had a lot of success having a Postgres table with a json blob and then a bunch of columns of interest, populated by a trigger.


This works surprisingly well.

I end up just doing both - having one table store the blob, and another with the copied values (shopify_sales, shopify_sales_jsonb).

Databases are generally pretty good at handling large BLOBS/off-page heap values, but I prefer not to tempt the fates of buffer pool management.


Have you heard of steampipe? It's a postgresql foreign data wrapper that exposes apis as virtual tables. It was recently discussed on an hn thread about pg FDW: https://news.ycombinator.com/item?id=28227905

I feel like this is a pretty good strategy, since you can either query from the api directly -- which seems to be surprisingly effective -- and drop down to materialized views if you need to cache the results for heavily repetitive queries.


Just wanna leave this [0] here: simonw's keynote on Datasette and Dogsheep at csv,conf,v6 was excellent and entertaining.

[0]: https://www.youtube.com/watch?v=UFn82w-97kI


Datasette looks really useful, the introduction [1] is well done!

1 - https://www.youtube.com/watch?v=7kDFBnXaw-c


You do realize that the person you're replying to - simonw here on HN, is Simon Willison - the creator of Datasette!


I do! I hadn't come across it yet, and it's always great to see a creator introduce his work well, great work all around!


What are some useful ways you've used your own data?


To be honest it's mostly for fun, and to help me dogfood https://datasette.io/ and come up with new features for it.

But just a moment ago I was trying to remember the name of the Diataxis documentation framework - I was sure I'd either tweeted about it or blogged it, so I ran a personal search in Dogsheep Beta and turned up this tweet: https://twitter.com/simonw/status/1386370167395942401

Someone asked me the other day who they should follow on Twitter for Python news, so I searched all 40,000 tweets I have favourited for "Python", faceted by user and told them the top four users.


This is very different from what the article is talking about. This falls 100% under “schema on read”


If you're taking a JSON API response structure and creating an exact 1-to-1 mapping to a SQL schema, you're doing schema on write.

If you store the JSON in a JSONB column, and then create projections as needed by parsing fields from the BLOB, then you have schema on read.


My code automatically creates a column for each key in the JSON object, adding any columns that are missing.

If a key represents a nested object, I dump JSON for that object into a text column (since SQLite doesn't have a separate JSON type).

What I'm doing doesn't feel different enough from "dump it all in a JSON column" to justify different terminology.

Here's an example of my results: https://github-to-sqlite.dogsheep.net/github


I consider schema-on-read to be the property of a system that retains the original response of a remote system call (mostly) unmodified, such that any derived values can be re-created to account for mistakes/omissions.

Doesn't necessarily mean that you have to project your schema dynamically every time from the original data, just that you have that option. If you're able to (mostly) re-create the original response and recover from any error, I think that's close enough to schema-on-read though the pedant could disagree (easy examples of fp precision come to mind).

I think of schema-on-write in the context of data from a remote system being copied to a local system such that mistakes/omissions require fetching data from the remote system to reconcile.

It sounds like you've also tackled dynamic schema migration from JSON graphs to Sqlite relations, minimizing relation count by keeping nested graph objects as their JSON string representation.


That's a really clear explanation, thanks!

I think what I'm doing counts as schema-on-read then: the tables I am creating can be used to recreate the exact API response that I recorded, because there is a one-to-one mapping between the columns and the keys in that original JSON.


Your using the database definition of “schema” which is orthogonal to the definition in the article. Use the example of the browser’s bookmarks list. The browser is storing that with a DB schema. Yet an ever growing list of bookmarks has no “conceptual/knowledge base” schema


Yeah, that's why the parent comment says they are doing the opposite from the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: