Hacker News new | past | comments | ask | show | jobs | submit login
JSON with Sqlite (sqlite.org)
473 points by ausjke 51 days ago | hide | past | web | favorite | 68 comments



I've been using the JSON1 extension for some time now (in production projects) and it's truly remarkable. I usually just dump the JSON-response data from an API to a "raw_data" table (typically one "updated_at" column and a second "json_data" one).

At that point you can somehow normalize your schema, but only if you really have to! That is because you can get away with a NoSQL-like denormalized schema performance wise, by carefully defining index on expressions. You can somehow normalize it with views (SQLite doesn't support materialized views).

And of course it's almost always faster to query data where it exists (via SQL) instead of fetching it from disk and querying it say Python. You pay too much IO cost. (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )

The json dump is not stored in binary, but in text format, but honestly I haven't seen this to be a problem, plus you can easily run queries at the CLI and pipe the output to jq, sed etc.

If your application is data warehouse-like and read-heavy (for example an internal reporting dashboard) I can't see any reason why you should pay the cost of setting up a Postgres or MongoDB instance (although I do love both.)

It is true that SQLite does not support concurrent-writes, but (and that's a big BUT) if you carefully open connections only when you need them and use prepared statements, I can't see how you could run into problems with modern SSD hardware (unless you're Google-scale of coure).


> It is true that SQLite does not support concurrent-writes, but (and that's a big BUT) if you carefully open connections only when you need them and use prepared statements, I can't see how you could run into problems with modern SSD hardware (unless you're Google-scale of coure).

WAL mode is your friend. (Various SQLite drivers, including the Python one, are however somewhat buggy in their transaction handling and need workarounds; essentially they delay the BEGIN of a transaction until you issue DML statements which obviously breaks snapshot isolation entirely).

WAL mode allows one writer at a time without impeding readers.

(See https://docs.sqlalchemy.org/en/latest/dialects/sqlite.html#p... for the pysqlite workaround)


I see your point, but I'm very hesitant to change any configuration of SQLite. It kinda feels like one step too close to doing devops - which one wants to avoid by using SQLite I guess.

Having said that, I do play around with PRAGMA statements when it's really needed, but usually tweaking the code usually works fine - even increasing the timeout is probably enough :D


I read that first sentence like 3 times and still don't get it.


Often people use SQLite for sheer convenience instead of perceived performance or resource efficiency i.e. because SQLite is zero config, works out of the box. Sure spinning up a docker container with PostgreSQL is easy enough but why do that when you can use the default standard library with SQLite already embedded and linker configed?


SQLite is an alternative to fopen not a full blown DB...

And it’s performance and reliability are a huge part of why it runs on millions of devices everywhere.

Your browser was never going to embed a build of PostgreSQL and Docker.

Enabling WAL can hardly be called config, it’s a one liner in every driver I’ve ever seen.

It’s like saying specifying the file directory SQLite uses is config.


> It’s like saying specifying the file directory SQLite uses is config.

Or, to continue the fopen analogy, like saying that the file you've given fopen is to be opened with mode "a+". The only real difference is that fopen's mode argument is required whereas SQLite's mode has a useful default.

> Enabling WAL can hardly be called config, it’s a one liner in every driver I’ve ever seen.

Even in the raw libsqlite3 C API (as long as you don't need any error checking ;) ):

    sqlite3_exec(db, "PRAGMA journal_mode=WAL", NULL, NULL, NULL);
presumably because it's just a one-liner in SQLite's dialect.


> (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )

This hits a little bit too close to home for me. I am quite proficient at writing performant SQL queries and recently started using Pandas. I find the data frame abstraction better for certain data manipulation tasks. Assuming there is enough RAM available is it still better to offload everything to the database engine?


As usual: it depends... If you're doing prototyping and are working in a Jupyter notebook, sure, go ahead and work on the Pandas-level. Once however you're done with prototyping and have settled to a "final_df" (I bet you have something like that in your last notebook cells), maybe you should think transforming some of the "columns" to sql queries (which are VCS-able, sometimes are faster, and most importantly other people can use them too. And instead of 10 people loading 10 different DFs, you can have 10 people querying the same table/view.


> Assuming there is enough RAM available

You will also have to assume you don't care much about the latency introduced by transferring everything over to your working process.

(And in batch data processing situations you typically don't care much.)


Dask has an out-of-memory dataframe implementation. Works great! I think it might support sql queries for that matter.


Dplyr works great with SQL (both SQLite and others).


> (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )

You really don't even need to know SQL anymore to keep things out of memory. In R, the dplyr/dbplyr package has SQL translations so you can utilize the exact same syntax as you would on in-memory data frames and it will execute as SQL using the database as a backend.

Not saying people shouldn't learn SQL regardless, but even that shouldn't be an excuse for doing everything in-memory these days.


I recently wrote an SQLite extension that lets me query stored Protobuf messages. I was inspired by the JSON extension. Adding virtual tables to SQLite with an extension is tricky but kind of magical when you finally have the full query language to play with.

https://github.com/rgov/sqlite_protobuf


Do you have performance numbers, especially in comparison to json1? Wondering how much benefit you get from a binary encoding scheme.


I don’t, sorry. I did not build this with performance in mind, but for querying packet captures of a protocol built on Protobuf.

Any query is going to require deserializing every row in the table (JSON and Protobuf alike) which is basically a non-starter for any project with performance requirements. And I believe that Protobuf messages are decoded as a whole, rather than just the desired field, which is going to be slower.


Interesting library, thanks!

> Any query is going to require deserializing every row in the table (JSON and Protobuf alike) which is basically a non-starter for any project with performance requirements.

Apparently you can create indexes on the json functions, so I assume on your proto ones too.

> I believe that Protobuf messages are decoded as a whole, rather than just the desired field, which is going to be slower.

With the official library, yes, but it's an implementation/API choice. The wire format [1] can be skimmed fairly efficiently. Most significantly, message/byte/string fields all are written as tag number, length, data. So if there are submessages (entire trees) you don't care about, you can just skip over them. I've seen custom proto decoding things do this.

I think the most efficient approach would be to parse the path once per SQL query into a tag number-based path. And then have the per-row function just follow those with custom decoding logic. Unfortunately from my quick skim, it looks like SQLite's extension API doesn't really support this. You'd want it to build some context object for a given path/proto, then call extract with it a bunch of times, then tear it down. Still, I suppose you could do a LRU cache of these or something.

btw, I see your code is compiling a regex [edit: originally wrote proto by mistake] in the per-row path. [2] I haven't profiled, but I'd bet that's slowing you down a fair bit. You could just make it a 'static const std::regex* kPathElementRegexp = new std::regex("...")' to avoid this. (static initialization is thread-safe in C++. The heap allocation is because it's good practice to ensure non-POD globals are never destructed. Alternatively, there's absl::NoDestructor for this.)

[1] https://developers.google.com/protocol-buffers/docs/encoding

[2] https://github.com/rgov/sqlite_protobuf/blob/0a148ac6a5a2c02...


You mean it's recompiling the regex on every pass? Ah, yeah, I hadn't thought about that. Thanks for the tip.


To check if you already have it installed run "pragma compile_options;" and look for ENABLE_JSON1. On my Mac I had it installed both on sqlite3 in terminal/bash and Python :-)

  import sqlite3
  con = sqlite3.connect(':memory:')
  con.enable_load_extension(True)
  a = con.execute("pragma compile_options;")
  for i in a: print(i); #check for ENABLE_JSON1
The newest releases of sqlite3 have it included. If not you can build it this way: https://burrows.svbtle.com/build-sqlite-json1-extension-as-s... Note that you will only get this error on load if you have it: 'Error: error during initialization:'


What are some useful use-cases for this?

I can already use JSON in Sqlite by doing parsing in the client outside of the Sqlite API. Any examples where I'd want to use this instead? I'm guessing for where clauses in queries, perhaps?

Can I create an index on a field within a JSON tuple?


What I have found with any relational database that has this kind of JSON storage option(MySQL, PostgresSQL) is that you can take some kind of data that arrives via 3rd party as JSON and just dump it straight into the db as JSON as opposed to having to create a schema for it.

It's nice to have the option for the data to still be queryable without having to make it a first class schema with all schema setup involved.

It isn't for 1st class data that is used frequently. But when you might only need the data on occassion its nice to have productivity-wise.


One pattern that I've used when parsing large json responses is to have a table schema like

    frequently|used|data|in|these|cols|full_json_response
So you pull out the fields you need frequently but have the full json so you can dig in to details the response on an adhoc basis. Not very efficient as you can end up storing a big blob of json, but you're not throwing away any data when you parse the response and your schema stays very manageable. Also gives you the flexibility to add new fields to the schema by pulling the data out of the json.


I use this pattern a lot, too. Just make sure not to `SELECT *`, which could be much slower and waste a lot of RAM in some cases (ORMs using e.g. active record pattern are often susceptible to this when used naively).


I've also done similar where there are differing systems where the extra data is different, but common bits are, well common. For example supporting direct CC, Paypal and Amazon pay. Putting the common parts of the transaction as separate columns, with the full details in a JSON column.


Completely agree, I do this all the time. This is a useful pattern.


It's nice for being stage 1 of an import: pull the JSON in and use queries to populate other columns/tables with the extracted portions you're working with since you have the full power of a SQL database for many common cleanup tasks.


I work quite a lot with SQLite since is the foundation of https://redisql.com/

The JSON module is really a god send! It allows to do things that otherwise would be extremely painful, difficult or not ergonomic.

Just to take few examples, here (http://redbeardlab.tech/rediSQL/blog/JaaS/) is a simple way to store JSON and doing manipulation on it, it would have been impossible without the JSON module.

In this other examples (http://redbeardlab.tech/rediSQL/blog/golang/using-redisql-wi...) I use the module to avoid computation on the client and I leave the extraction of the value to SQLite, very convenient if you ask me.


RediSQL looks interesting and I like the pricing structure. I'm working on marketing my own set of products so I know it's difficult to keep it up. Good luck with this venture and I hope it works out for you and your team.


Great!

Please get in touch, there is my email in the profile! There is a small community of people that try to live with open source software.

If your product is somehow usable with RediSQL we could do marketing together or bundle our products, there are a lot of synergy in software that can be enhanced.


Since you can make effectively arbitrary index expressions[1], I'd expect the answer to be "yes". The article doesn't explicitly state if they're deterministic or not, but it seems like some of them could / should be.

[1]: https://www.sqlite.org/expridx.html


I gave it a shot and it appears to work as expected.

  sqlite> create table a (id int primary key, j json);
  sqlite> insert into a values (1, '{"hello":"world"}');
  sqlite> create index idx_a on a (json_extract(j, '$.hello'));
  sqlite> explain query plan select * from a where json_extract(j, '$.hello') = 'world';
  QUERY PLAN
  `--SEARCH TABLE a USING INDEX idx_a (<expr>=?)
  sqlite> explain query plan select * from a where json_extract(j, '$.foo') = 'world';
  QUERY PLAN
  `--SCAN TABLE a


I'm using this in a mobile application (Android and iOS). The data model can change server-side at any time, so the data is stored as JSON to avoid complicated client-side migrations.

Before this I loaded and filtered all the data in memory. Now, with json_extract, I can both index and filter the data using SQLite, which is a massive performance boost.


Its really common to store json documents inside rows in normal relational databases.

This can be because, actually, you have a json blob that is associated with the row e.g. I have a script that scrapes some public registries of historic monuments (I have dull hobbies!) and it just stores the responses in a json column. Its convenient.

Another way these 'dynamic columns' are used is to flatten one-to-many relationships. For example, I have a database where account managers can add arbitrary tags to customers. The classic approach would be to have a customer table, then a 1:M into a tag table with the key value, and then another M:1 for the key to go from the key id to the key names. Instead, I just have a json column with all the key values in it, right there in the customer record. Its convenient!


I've done this too.

Note that for databases that support array columns, this is much nicer to represent with an array. Postgres supports them, for example. (SQLite doesn't)


Is there a link for the support documentation?


Column arrays have been in postgreSQL since 9.1 at least.

https://www.postgresql.org/docs/9.1/arrays.html


One thing I like is when getting data from multiple tables.

Let's say you have a product with multiple labels: usually you'll get one line per (product, label) tuple so you'll have to do some job application side if you what to get one [product, labels] object per product.

With the json aggregate function you can get a (product, json array of labels) tuple per product and just have to do some json_decode in your application code.


> "Let's say you have a product with multiple labels: usually you'll get one line per (product, label) tuple so you'll have to do some job application side if you what to get one [product, labels] object per product."

What's the payoff of doing [product, labels], particularly if that means a json array of labels? I briefly investigated doing that for my application but I found that it would increase disk space (not really a big problem) but it would also make efficient querying a big chore (e.g. labels->products queries.)

I can see this being a good tradeoff if querying by labels is extremely rare and instead you only ever want to query product->labels. But that's still pretty damn fast with the (product, label) tuple schema.


You don't change your schema to add some json field.

When querying you do something like

  select p.name, json_agg(l.value) as labels
  from product p, label l


I use this for “pulling” data out of the db hierarchically, similar to something like GraphQL:

https://github.com/coast-framework/lighthouse/blob/master/RE...


There is the possibility of storing a binary-encoded json object, which uses far less space than regular json, and since the server is doing the encoding you aren't going to need custom client side libraries to decode it.

There is also the possibility of magic 'compression' by having the server automagically extract common schema elements of the stored data to probably cut the storage size in half again.


sqlite doesn't '(currently)' do either of these though.


Pivots and trees. Long skinny tables to wide sparse tables used to be something we did a lot, now we return a JSON Tree... we were always parsing the sparse table into a tree anyway.



I especially like the aggregator functions, it allows to select an additional dimension of data.


From a glance, this looks like its compatible with MySQL's JSON functions.

It isn't compatible, for example, with MariaDB's dynamic columns.

I'm not very familiar with the PostgreSQL json functionality, but I think that's subtly different again.

When I use local dbs for testing my json code I've used derby and defined the json_extract() functions etc myself just to test. With Sqlite having compatible functions, people wanting to test mysql stuff locally will be able to just point their code at an Sqlite DB instead. Great!

The bit that seems to be missing is the shorthand for selecting column values; in MySQL, instead of doing SELECT JSON_EXTRACT(col, "$.this.is.ugly[12]"), ... you can just do SELECT col->"$.this.is.ugly[12]", ...

Now what I want to be able to write is SELECT col.this.is.nicer[12], ....

I think there is some 'standard' somewhere that MySQL - and now Sqlite - is implementing? The functions and the 'path' syntax are standardized (although with only MySQL and now Sqlite supporting them its not perhaps a big deal). I just can't find any reference to that standard in the MySQL docs, nor this Sqlite doc.

Personally, I dislike the path syntax though! Every time I see an sql snippet with string paths full of dollar signs it offends my retinas.


According to [1], ISO/IEC 9075:2016 specifies JSON functionality in ISO SQL, though I don't have access to it/didn't buy the standard text. I also don't know whether Postgres etc. is designed to follow the standard, as Postgres' implementation predates it, but would expect sqlite to.

[1]: https://modern-sql.com/blog/2017-06/whats-new-in-sql-2016


Postgres 11 doesn't follow the standard at all (as per your link). A third party implemented the standard [0] which was submitted on pghackers[1] very late in the Postgres 11 cycle[2] (at 75k gzipped the patch is pretty humongous). A quick googling doesn't show any information pertaining it being integrated (in part or full) into pg12 so IDK.

edit: apparently https://commitfest.postgresql.org/17/1471/ has been "Waiting on Author" and bumped from CF to CF since early 2018, and https://commitfest.postgresql.org/17/1472/ and https://commitfest.postgresql.org/17/1473/ pretty much the same with no "waiting on author" but I don't really know how CF works and I see no comment or requests or reviews so…

[0] https://obartunov.livejournal.com/200076.html

[1] https://www.postgresql.org/message-id/CAF4Au4w2x-5LTnN_bxky-...

[2] https://www.postgresql.org/message-id/00531c7e-f501-b852-9b6...


They're actively working on it. There's been a new revisions posted over the last few days, responding to review from a week or two ago.

https://www.postgresql.org/message-id/a3be6a7a-77d3-0e88-4f9... https://www.postgresql.org/message-id/c2f32c9f-9a69-202b-a8a...

> I see no comment or requests or reviews so…

That happens on the mailing list...


There's a technical report ISO/IEC TR 19075-6:2017 corresponding to the standard, freely available from [1]. It's a much more accessible version of the standard if you're not looking to implement it.

1. https://standards.iso.org/ittf/PubliclyAvailableStandards/


SQLite is really an awesome database, this seems great to store arbitrary json data!


> Backwards compatibility constraints mean that SQLite is only able to store values that are NULL, integers, floating-point numbers, text, and BLOBs. It is not possible to add a sixth "JSON" type.

I'd be interested to know what these constraints are? Does SQLite guarantee that files created with newer SQLite versions are still compatible with older SQLite versions?


From https://www.sqlite.org/onefile.html :

> The SQLite database file format is also stable. All releases of SQLite version 3 can read and write database files created by the very first SQLite 3 release (version 3.0.0) going back to 2004-06-18. This is "backwards compatibility". The developers promise to maintain backwards compatibility of the database file format for all future releases of SQLite 3. "Forwards compatibility" means that older releases of SQLite can also read and write databases created by newer releases. SQLite is usually, but not completely forwards compatible.


And yet, "The json1 extension uses the sqlite3_value_subtype() and sqlite3_result_subtype() interfaces that were introduced with SQLite version 3.9.0 (2015-10-14) The json1 extension will not work in earlier versions of SQLite."


This just means you can't use the json1 extension with an old version of the library. But anything you write using it, will still be readable with an older version.

Of course, if you do something like creating an index using a json1 expression, that will cause issues if you try to use the database with an older version.


This should be build in standard SQLite.


SQLite is used in a LOT of environments, including low-overhead embedded systems. This means it's impractical to add the overhead for those systems. As to desktop and mobile, I absolutely agree, it should be the default in a lot of places.


The documentation specifies that that is a possible end state:

> The "1" at the end of the name for the json1 extension is deliberate. The designers anticipate that there will be future incompatible JSON extensions building upon the lessons learned from json1. Once sufficient experience is gained, some kind of JSON extension might be folded into the SQLite core. For now, JSON support remains an extension.


Is loading extensions at runtime new? Coulda sworn I used to have to recompile for this


It seems valuable enough to have in the standard build. That and the CSV extension as well.


Especially for things like Python's embedded SQLite. Packaging and loading an extension module cross-platform with an otherwise pure Python script seems difficult.


Pretty sure it's been around for at least 10 years or so. (But it's not a widely used feature, I think.)


JSON1 is compiled into every Linux distro's SQLite I tested and is also in Mac's SQLite, as well as the Windows binaries provided by sqlite.org. The only exception I found was the SQLite bundled with Python for Windows, which doesn't seem to have any extensions at all.

Performance-wise this is fast enough that you can do faily complex un-indexed queries (so full table scans) against tables with a few ten-thousand to hundred-thousand rows and have them complete in a couple tens msecs.


The JSON1 extension can be used in Android with the sqlite-android library

https://github.com/requery/sqlite-android


One of the nicest things you can do with this is get hierarchical results.


Why not converting JSON from/to BLOB?


From the article:

The json1 extension does not (currently) support a binary encoding of JSON. Experiments have been unable to find a binary encoding that is significantly smaller or faster than a plain text encoding. (The present implementation parses JSON text at over 300 MB/s.) All json1 functions currently throw an error if any of their arguments are BLOBs because BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.


Seems like it has been updated to say 1GB/s now.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: