But if you are going to be querying on a particular, known-ahead-of-time value in the JSON, why did you use JSON to begin with? It's not like the data came through some other system, since this is SQLite. Couldn't you have just extracted the pieces you cared about ahead of time?
Adding bidirectional lenses to “normalize” and “denormalize” data before it enters and after it exits a SQL database means you need to maintain those lenses separately from the SQL database schema. You would have three things to keep track of - the upstream data source (the thing supplying the JSON, might not be under your control), the lenses in your SQL client source code, and the SQL database schema.
Using virtual columns like this encodes the “normalizing” lens into the SQL schema in a purely-functional way. This seems like a pure win to me for clients that need to cache data but never update it. The client code can be a “dumb pipe” on the write side and just dump whatever received data into the SQL database. In essence, you are getting CQRS (command/query representation separation) for “free”, built into the storage system. Why spend the engineering cycles in the SQL client?
For a real world example - at Notion we have a SQLite database running in our client apps. Most of the rows come from the back-end API and serve as a local cache. There are some object properties the clients need to query, and other object properties the clients don’t care to query, but are needed to render the data in the UI, so it still needs to be stored locally. Over time, those needs change — as do the shape of the upstream API data source. So, we put un-queried object properties into catch-all JSON columns.
When we introduce a new query pattern to the client, we can run a number of different migration strategies to extract existing data from the JSON column and put it into a newly created column, or use a virtual column/index. Having the catch-all JSON column also means we can add an object property in our backend for web clients, and then later when we roll the feature out to native apps the data is already there - it just needs a migration to make it efficient to query.
In Postgres there is a family of INSTEAD OF triggers that allow you to do all this lensing within the database so that the application can have multiple views of the data that all resolve to a single normalized structure internally but present in a variety of ways. To my googling knowledge SQLite also supports INSTEAD OF triggers - is there a benefit to using the predefined JSON columns instead of writing the data into an actual normalized structure and then zipping it up at read time or during the regeneration of a materialized view?
Using `INSTEAD OF` or writing to a view implies that the SQL schema was designed with foreknowledge of the data that will be stored in the schema. These tools let you lens from one known, totally-specified schema (a "view") to another known schema (a "table") at write time. In contrast, the catch-all JSON column strategy allows us to store data the schema has no idea about -- or may never need to understand.
You're right that once we move a property from the "don't query, just store" category to the "query and/or update locally" category, we could use views/INSTEAD OF trigger to keep the JSON column and the normalized view in sync.
In the example given in the article foreknowledge of the property does exist - there's a specific DDL statement to run before the column-y thing is usable.
With the advent of schemaless I think there's been far too strong of a reactionary movement against schema adjustment. With some trivial investment in migration management infrastructure you can arrive at a point where changing the DB (even in drastic ways) is as easy as committing a new file to the code base. At my company we do make use of postgres JSON types for some limited fields related to logging and auditing where live queries never need to be executed - where the data is essentially serving as WOM - we've also got once instance where our DB is essentially a pass-thru between two services that don't talk to each other directly, for this usage JSON works great since those two services just coordinate on the expected format. In most other cases we'll just write quick migrations to expand our data definition as needed to fit whatever we need to put into it.
Relational databases are both powerful and quite flexible - so long as you don't enshrine the schema to an extent where it feels immutable.
It’s easier to evolve a schema to be strictly normalized when you have only a few (<1000) SQL databases running in an environment you control, especially if you also control the deployment of code that talks to those databases. In that kind of setting you can easily manage any kind of migration, and you know when a migration is fully rolled out and you can drop backwards compatibility.
In Notion’s SQLite use-case, we have many millions of instances of this schema running on hardware owned by our users, and those users ultimately control when our app updates & runs migrations on those devices. If I commit a new migration today, it’ll start rolling out tomorrow, but it will take weeks to saturate the install base of our app, and we can never be 100% sure all existing clients have upgraded.
Having JSON catch-all columns means we can evolve our API data types in parallel with that slow SQL schema rollout, instead of needing a strict order of “migrate first, then start writing” for any new bit of data we want to render & cache on the client.
I think that for Notion (neat company btw) it might make a lot of sense to keep the schema pretty loose but I have managed to keep an agile DB alive and well with a lot of flexibility. A lot of migration infrastructure packages and tools will store the migration state in the DB itself - we're using Phinx[1] internally which creates a `phinxlog` table with a record of migrations that have been run - there is tooling to migrate only up to a specific version and since the record of all executed versions is stored in the DB the tool can easily figure out what works needs to be done in which circumstances. The result is that we can easily roll different environments onto different versions.
Neither schema driven nor unstructured is always the right call - they both have their strengths and weaknesses - but I think that trust in data integrity is pretty important when writing flexible code on top of a data layer. Knowing that expected keys can't be omitted and that so-and-so column must conform to a given data domain can really alleviate defensive coding costs.
If you'd like to talk some more I can shoot you an email and we can sit down sometime?
If it is a cache, is it really beneficial to store that data (wasting space) rather than query it again after migration (wasting bandwidth)?
If a lot of my cached data is never used again and waiting to be replaced by e.g. LRU strategy, storing all those fields that can't even be used in the current version seems like a huge waste of space.
It seems that it would only be a good idea if most of the fields of the JSON objects are extracted and used, and most of the cached objects are used again. Which might be the case at Notion (where the cached data is probably pages) but probably not universally.
There are some object properties the clients need to query, and other object properties the clients don’t care to query, but are needed to render the data in the UI, so it still needs to be stored locally.
For example, today we run queries on "parent page ID", so that is a normalized parent_id column in SQLite, instead of being stored in the JSON column.
We don't currently run queries on "text color" or "image width" object properties, but we still need to store that data in the cache to render a block. So, it's fine to leave those things in a JSON column. If we need to run a query like "find all images that are full-width" in the future, we can add a new image_width column/virtual column/index at that time.
You can choose whether to store the values or to compute them every time in SQLite
But in terms of the overall performance budget, for a mobile app, I don't think the choice matters much. JSON in SQLite is fast and even in the case of "storing every generated column with a lot of user data", I imagine the app binary is going to be larger than the database it produces.
In my case, I have a bunch of data in the JSON, but found users need to filter requests based on just one field of the data.
In specific, I'm loading Postfix logs into the database, I parse the logs into JSON and then store that JSON in a column. The format of that JSON comes in ~4-5 flavors, depending on the exact log event, but most of them include a "to" field (the envelope recipient), that the API users need to filter on. I ended up adding a "to" column and mirroring the JSON "to" field into that, but this would have worked as well.
The API users still have access to that additional rich data, and the whole thing overall is simpler because I don't have it exploded out into a bunch of fields that have many of them NULLs for various log types.
What if you know a few fields, like some "id" or "name" at the top level, but the rest of the JSON is highly nested, tree-like?
Instead of mapping it onto relational tables and keeping indexes, and then running a complex query, you can just put them all into a ready-made, well, blob.
Note that it only makes sense to have a relational column if you need to (1) use it as a query condition or (2) update it independently of others. In many tables, many columns are neither, and can be safely stored as JSON.
Exactly the point I was going to make. It's also relatively trivial to generate view statents from the schema of the object if required for analytics (not fast though).
Much better than a crazy ORM rat nest in my experience.
One idea I have been considering - use SQLite as my JSON parser in Lua. Lua has solid SQLite bindings, but there are several JSON libraries, with varying performance characteristics. If I create a :memory: database, I could shunt JSON in/out of SQLite and use their performant implementations to extract fields from APIs.
Admittedly, not a standard use case, but something I intend to try in the near future.
Be careful with that. Since you'll end up coercing JSON types into SQLite types doing this, you might run into some odd edge cases. One big one is around big numbers:
It may never be a problem for you, but it's been a pain point for me dealing with different JSON libraries in the past when people stuff giant numbers into something because their end has no problem with it.
It seems more like a hack than a good idea. If you have to run with only the Lua standard library, and that's the only (performant) JSON parser in the library, do what you have to do; but otherwise it seems like a bad idea.
>Couldn't you have just extracted the pieces you cared about ahead of time?
Sure, but if you discover this desire later, you now have historical data to change and application behavior to change and database structure to change. And repeat all that when you discover the next piece of more-useful-than-anticipated data.
Or you can just `alter table` and be done with it.
The two things are actually independent of each other: you can create a virtual column and an index, or do only one. The generated column can be used as an alias of an expression in queries (e.g. ORDER BY clauses), while just creating an index cannot. The generated column can be optionally virtual (although that defeats the purpose listed in this article), while the index is obviously always stored. The generated column reduces the likelihood that you accidentally don't use the index, because "the SQLite query planner will consider using an index on an expression when the expression that is indexed appears... exactly as it is written in the CREATE INDEX statement." [0]
I think the way the article does it just provides an easy, symbolic way to reference the computed value... so it gives a little readability, it's maybe easier to ensure your where clause and index match up.
Also, depending on your access patterns, you might want the generated column to be stored (evaluated on write) rather than virtual (evaluated on read).
SQLite doesn't support adding generated stored columns to a table once the table has been created, so you basically have to decide upfront which generated columns must be stored, and then you can't touch it anymore (short of dropping it).
That doesn't matter. The json_extract function returns a string, so json_extract('[{"foo": "x"}]', '$.foo') is NULL and json_extract('[{"foo": "x"}]', '$[0].foo') is 'x' and it doesn't matter if this is destined to go into a generated column or an index.
[append] And SQLite's json_extract doesn't have anything like '$[*].foo' or any mapping functions. You could use json_each to do that, but the process is very different from creating an index.
I think the advantage for simple objects is just that you can refer to the columns directly vs having to go through a function. It's only really a big win if you're manually writing a lot of sql or if you're using something like Rails and you want to write queries against values in the JSON blob without going through a bunch of hoops.
Concise and to the point, I’ll keep this around for experimenting with JSON in Sqlite later. I tend to use JSON files for small applications but if using JSON works as well as reported in Sqlite that will be my new go-to even for tiny experiments. I’ve used Sqlite before but I imagined that it would not work so well with JSON.
I've been using sqlite w/ JSON columns for doing a series of web scraping projects at work. For each site/service scraped, the general flow is to capture a set of key variables then stash other data in one or more JSON columns. This way I can easily get data my boss is looking for in the short term using the non-JSON columns, then can (usually) easily pull additional information using queries on the JSON columns so I don't need to rescrape the same pages. Then I can create new columns and backfill them using queries against the JSON columns if need be. It's been working amazing for me so far. I will say getting used to the json1 query syntax can be a little confusing at first since it doesn't quite feel like SQL syntax nor dictionary/object sort of syntax -- it's more like jq. So be prepared to sit with the json1 docs + stackoverflow for a little while. But once you get that under your belt, I think you'll be impressed with how quickly you can move with this approach. I also used to use little JSON/CSV files for small projects but after getting comfortable with sqlite + json1 and having many experiences where I end up exporting that data to sqlite anyway, I just go straight to storing everything in sqlite and using JSON columns when I don't have time to design a proper relational database and/or when I want to save JSON to disk in an easily queryable and modifiable fashion. If you're on the Python side of things, check out peewee. It's become my go to ORM and has excellent support for sqlite extensions like json1. You define a variable pointing to your sqlite file, a class for each table, have peewee create the tables, and you're all set
It works really well. I keep a couple "master" tables and do a bunch of views off those main tables to do ETL reporting for one of my projects. It is performant, and the JSON support has been robust enough for everything I've tried to do.
> Virtual columns are defined of SQL:2003 as Generated Column, and are only implemented by some DBMSs, like MariaDB, SQL Server, Oracle, PostgreSQL, SQLite and Firebird.
> The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
and it's too slow for all others. I learned a lot from that interactive SQLite thing and I'm glad I didn't just let myself believe that 500ms was ok. (I'm not saying that you just default to a "slow is ok" stance, I'm just saying I'm glad I didn't. this is the internet, where everyone misunderstands me and pretends I say things I don't, so I need to be explicit, sometimes.)
”don’t add a label for something until you know you need it! Use filter expressions and brute force those logs. It works – and it’s fast.“
Now I don’t advocate brute forcing text as a replacement for a database. But for filing cabinets, like logs with very unpredictable formats, that some admin will look at every once in a blue moon, those 500ms extra per query are nothing. Put it in relation to someone doing it completely manually.
If you can already subset the logs by date and component, scanning the few thousands entries of interest is probably cheaper than maintaining indexes on many fields for all kind of imagined queries. The average-case algorithmic complexity doesn't matter if you can keep N low, even if you run the algorithm a lot.
The "this'll work for most use cases" approach to solving problems is great until it isn't, but you most likely have a "good problem" on your hands if you hit that point.
Yeah, using expression indexes or generated columns like this is a tradeoff - slower writes for faster queries & more declarative (de)normalization. The same could be said about any index, though… why is this index strategy more concerning? All indexing trades off write speed for read speed.
> You decide to store the raw JSON, as normalization is non-trivial.
You lost me. If you have to build the index, normalization is immediately trivialized. This is just always a sloppy hack so this advice is just a big “temporary workaround” until normalizing for hot paths.
Take various data collection analysis tasks as an example - there's many cases where you may not have a good idea of how everything fits together until you can run queries on the data, with JSON being a very common format. You many may need to combine many JSON datasets, with different layouts and potentially deeply nested structures, as well as more traditional tabular data. Being able to query all of this by putting everything into a temporary sqlite file is incredibly powerful. Sure, you're moving much more into SQL territory once you're creating indexes on JSON data as you say, but I'm not sure that always means "sloppy hack". For data tasks at least, being able to combine the power of relational + noSQL databases is incredibly productive. Doesn't mean I would necessarily put something like this into production, but for figuring stuff out combining the two approaches has a lot of merit in my experience.
normalizing from raw json isn't terribly complicated but it's definitely more complicated than just indexing some paths into it. if the json is loosely defined normalization might not be practical (ie requiring too many tables or needing to be revisited too frequently as the json format evolves too often, etc, etc.)
The example I like to give is different payment types... CC/Payflow, Paypal, Amazon Checkout, etc... all you really care about is the type and amount and success... The additional details, different transaction id's etc are different... this allows one to use a single field for the "extra" details, while normalizing a few bits that may come back from different services in a different way, but a single type of collective data.
Normalizing the above would mean at least 4 tables, with subqueries and joins that really aren't valueable at the database layer.