I've come around to almost the opposite approach. I pull all of the data I can g...

dwohnitmok · on Aug 21, 2021

This is also the direction a lot of data engineering is going, where systems are going from ETL to ELT (that is loading data with minimal transformations as soon as possible and doing complicated transforms after that). This is e.g. why tools such as dbt are getting popular.

But I suspect OP's point is probably that you do at least impose a schema on import rather than letting it just be featureless JSON blobs or EAV tuples, which makes the job of understanding that data later so much easier.

nerdponx · on Aug 21, 2021

+1 to both. "Store things in a schema that reflects the origin" is schema-on-write, just a very particular kind of it.

specialist · on Aug 22, 2021

> systems are going from ETL to ELT

Interesting. I did this for medical records, mid-aughts.

The phrase I used was "transform & translate at the last possible moment". Ideally in the reporting step (server side rendering the HTML).

Since all of our customers kept changing their minds, totally their prerogative, my strategy minimized cost of change and turnaround time. Prior products would have to "rerun the data" from scratch, for schema changes. We'd make the same query tweaks, but at the end instead of the beginning of the pipeline. (Hardest part was always nailing down the requirements, expectations. Then the code basically wrote itself.)

This strategy made people's heads explode. Old schoolers just couldn't abide by it. Despite our small team running circles around everyone else. Data must be normalized! All the same! We had to be cheating somehow. They did everything possible to destroy us, and eventually succeeded.

fiddlerwoaroof · on Aug 21, 2021

I’ve had a lot of success having a Postgres table with a json blob and then a bunch of columns of interest, populated by a trigger.

tomnipotent · on Aug 22, 2021

This works surprisingly well.

I end up just doing both - having one table store the blob, and another with the copied values (shopify_sales, shopify_sales_jsonb).

Databases are generally pretty good at handling large BLOBS/off-page heap values, but I prefer not to tempt the fates of buffer pool management.

infogulch · on Aug 22, 2021

Have you heard of steampipe? It's a postgresql foreign data wrapper that exposes apis as virtual tables. It was recently discussed on an hn thread about pg FDW: https://news.ycombinator.com/item?id=28227905

I feel like this is a pretty good strategy, since you can either query from the api directly -- which seems to be surprisingly effective -- and drop down to materialized views if you need to cache the results for heavily repetitive queries.

avnigo · on Aug 22, 2021

Just wanna leave this [0] here: simonw's keynote on Datasette and Dogsheep at csv,conf,v6 was excellent and entertaining.

[0]: https://www.youtube.com/watch?v=UFn82w-97kI

hrishi · on Aug 22, 2021

Datasette looks really useful, the introduction [1] is well done!

1 - https://www.youtube.com/watch?v=7kDFBnXaw-c

thunderbong · on Aug 22, 2021

You do realize that the person you're replying to - simonw here on HN, is Simon Willison - the creator of Datasette!

hrishi · on Aug 22, 2021

I do! I hadn't come across it yet, and it's always great to see a creator introduce his work well, great work all around!

mixedmath · on Aug 21, 2021

What are some useful ways you've used your own data?

simonw · on Aug 21, 2021

To be honest it's mostly for fun, and to help me dogfood https://datasette.io/ and come up with new features for it.

But just a moment ago I was trying to remember the name of the Diataxis documentation framework - I was sure I'd either tweeted about it or blogged it, so I ran a personal search in Dogsheep Beta and turned up this tweet: https://twitter.com/simonw/status/1386370167395942401

Someone asked me the other day who they should follow on Twitter for Python news, so I searched all 40,000 tweets I have favourited for "Python", faceted by user and told them the top four users.

jtsiskin · on Aug 22, 2021

This is very different from what the article is talking about. This falls 100% under “schema on read”

tomnipotent · on Aug 22, 2021

If you're taking a JSON API response structure and creating an exact 1-to-1 mapping to a SQL schema, you're doing schema on write.

If you store the JSON in a JSONB column, and then create projections as needed by parsing fields from the BLOB, then you have schema on read.

simonw · on Aug 22, 2021

My code automatically creates a column for each key in the JSON object, adding any columns that are missing.

If a key represents a nested object, I dump JSON for that object into a text column (since SQLite doesn't have a separate JSON type).

What I'm doing doesn't feel different enough from "dump it all in a JSON column" to justify different terminology.

Here's an example of my results: https://github-to-sqlite.dogsheep.net/github

tomnipotent · on Aug 22, 2021

I consider schema-on-read to be the property of a system that retains the original response of a remote system call (mostly) unmodified, such that any derived values can be re-created to account for mistakes/omissions.

Doesn't necessarily mean that you have to project your schema dynamically every time from the original data, just that you have that option. If you're able to (mostly) re-create the original response and recover from any error, I think that's close enough to schema-on-read though the pedant could disagree (easy examples of fp precision come to mind).

I think of schema-on-write in the context of data from a remote system being copied to a local system such that mistakes/omissions require fetching data from the remote system to reconcile.

It sounds like you've also tackled dynamic schema migration from JSON graphs to Sqlite relations, minimizing relation count by keeping nested graph objects as their JSON string representation.

simonw · on Aug 22, 2021

That's a really clear explanation, thanks!

I think what I'm doing counts as schema-on-read then: the tables I am creating can be used to recreate the exact API response that I recorded, because there is a one-to-one mapping between the columns and the keys in that original JSON.

jtsiskin · on Aug 22, 2021

Your using the database definition of “schema” which is orthogonal to the definition in the article. Use the example of the browser’s bookmarks list. The browser is storing that with a DB schema. Yet an ever growing list of bookmarks has no “conceptual/knowledge base” schema

quelltext · on Aug 22, 2021

Yeah, that's why the parent comment says they are doing the opposite from the article.